Legacy Contract Data Extraction: Software, Manual or Both?

Contract data extraction or contract review is a vital step of the contract management process for any business to thrive. In contract management, review means critically analyzing the information present in a contract. The old way of capturing essential data attributes from legal documents has been transformed, thanks to technology. We hear about Artificial Intelligence (AI), we hear about Machine Learning software systems and other new technologies for data extraction from contracts. But, most of these modern-day AI software solutions run on predefined algorithms and patterns, at times failing to identify errors that might have occurred while converting the scanned documents into text-based OCR documents, resulting in incorrect abstraction and migration of data.

What is OCR?

OCR is the acronym for Optical Character Recognition. When you scan in a paper document, the resulting scan is a picture or an image. The computer only sees “dots”, and does not recognize characters – it is not meant as data. The formats that we are used to for pictures and images are. JPG, .TIF, .GIF, BMP. These are digital picture or image formats. If there is a copy of some kind on these digital images, comprising numerous dots then they need to be identified as information. They need to be “recognized” and converted to text. This process is known as Optical Character Recognition or OCR.

Challenges

Poorly scanned documents, hand-written text, and calculations, and date fields within the contracts can make it a little tricker to extract the information from contracts. OCR technologies are commonly used to convert scanned documents into machine-readable formats. But if there is poor quality data, handwritten text, and fuzzy scanning present in the documents, even the best software fails to convert the documents with 100% accuracy, which can lead to error-filled extraction.

Here is an example of a poor OCR converted document for extraction

If an extraction engine is configured to extract the above information i.e. confidentiality and non-solicitation agreement/clause, and come across this type of poorly scanned or OCRed document with incorrect spellings and spaces in the middle of the text, the software may fail to identify and extract the appropriate information.

Software is just 1/3rd the solution, & you need a stringent process and a trained team of lawyers to achieve the best quality extraction.

You can’t rely on software only. The software can be used as a first step to extract voluminous data fast, but a manual review is required to ensure 100% accuracy in the extracted data, especially in the cases like above.

Technology itself won’t fix the problem. You need to have the right people who will streamline the process in place and then you bring in technology to fix them

Human review is essential to ensure quality. A team of lawyers should be used to check what the software extracts, fix/fill-in-the-blanks of what the software couldn’t, maybe because of some OCR read errors or handwritten attributes such as signatures, etc.