Document images often have quality issues that make it difficult to extract text or data elements from them. For example:
- Forms can have lines running through much of the text.
- Watermarks can interfere with text recognition.
- Text orientation may be skewed.
Once specific issues have been identified, advanced image enhancement techniques can greatly improve the quality and quantity of information that can be extracted.
The big issue has been how to automate the issue identification process. While there are many text-based tools available for file or document analysis, they are of little use until there is text to analyze.
Visual classification technology is unique in not using and not needing text to make initial classifications. It uses what in simplistic terms could be thought of as facial recognition for documents. Just as facial recognition can recognize faces without text, visual classification groups visually-similar files without text. The groupings can be reviewed starting with the largest groups and decisions can quickly be made for over 99% of the files in the collection. Image review has three outcomes:
- Discard. These files are not needed, they have no ongoing business, regulatory or legal value.
- Keep without enhancing. These are documents that have ongoing value but do not need or would not benefit from enhancement.
- Keep but enhance. These are documents where searchability or attribute extraction would benefit from enhancement.
Under this approach no resources are wasted on document images that will be discarded or that already yield quality text, and specific issues are identified for the subset of images where enhancement will improve retrieval or data extraction.
Visual classification also provides collection content awareness. For the first time, collection owners can examine virtually all of the types of documents in their collection and be aware of how well they are represented using text-based/restricted tools.
Here’s one example where the background watermark on a birth certificate was removed, greatly improving the ability to extract textual values from the certificate:
On document groupings involving forms, BR can be used to recognize data elements that always occur in the documents, meaning they are part of the underlying form, and then negate or remove them. Because the geometric relationships among the graphical elements remain the same regardless of the resolution of the image, negation can be used on files or documents originating in a variety of ways.
If you have questions on how BeyondRecognition can help you manage your supposedly “unstructured” content, please contact us at IGDoneRight@BeyondRecognition.net.