OCR is often used to obtain text from image-only files for use in classifying them or in providing searchability. However, there are several limitations of OCR that result in inaccurate or missing text or make classification difficult:
- Font Size. OCR may not convert characters with very large or very small font sizes. This can make the most important characters and words unavailable for text-based systems.
- Uni-Dimensional. With OCR, individual words have one dimension, they’re either before or after other words. OCR does not catalog page coordinate information for characters even though page coordinates can be quite useful for classification and attribution.
- Sequential Editing. OCR errors typically have to be corrected sequentially with the same errors being repeatedly being edited. Global spell checking can introduce other errors.
- Case Sensitivity for Editing. The use of spell checking to correct OCR text will typically not permit the case of the letters to be considered, e.g., cat and CAT will be treated alike.
- Non-Textual Glyphs. Many times there are important non-textual characters or glyphs that do not get converted to characters by OCR, leaving them invisible for text analytics or text-based retrieval, e.g., logos, or map symbols.
- Languages. Many languages have special characters, and unless the correct OCR software is loaded, those characters can be lost or incorrectly recognized.
- Non-Symmetrical DPI for Faxes. Faxes are often stored in files where the number of dots per inch horizontally is not the same as the DPI vertically, and OCR engines can have difficulty with this non-symmetrical DPI.
- Incorrect Document Boundaries. Image-only files often contain multiple documents per file and OCR does not provide a way to correct document boundaries. This causes downstream problems with systems which classify files based on comparing the words that are used for them. Files are missed and the ones that are classified can be misclassified. For more information, see blog posting on Basic Assumptions Gone Wrong: ECM and Document Unitization, and Information Governance Lessons from 4 AFEs and a Daily Drilling Report.
A better approach to classification is to use visual classification which uses richer information about what documents actually look like than just extracted text. It provides scalable, consistent classification of all types of document files.
This posting is based on the book, Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content, due out later this summer. You can sign up at http://beyondrecognition.net/guide-to-managing-unstructured-content/ to receive your copy of the book when it is available