Most file auto-classification systems rely on the presence of accurate textual representations of the files being classified. Organizations that use those auto-classification systems need to be aware of several problems with a text-reliant approach:
- Ignoring Non-Textual Files. Many files have no text associated with them, e.g., files output as PDF or TIF files from user-software or captured as image-only documents by scanning or faxing. This may be a minor issue in some collections, but in others non-textual files may account for appreciable percentages of all files that need to be classified. At the very least the percentage of non-text files ought to be measured to help determine what sort of remedial effort may be justified.
- Misclassifying Poor-Quality Text Files. Text layers can be created by optical character recognition (“OCR”) software, but the resultant associated text can be riddled with errors, making text-based classification very problematic. One area of particular concern is being able to classify all versions of the same document type consistently, e.g., to be able to classify the original Word document with the PDF version and the scanned TIF version.
- Sentence Dependence. Some auto-classification systems analyze text as presented in sentences and ignore non-sentence text. This causes them to fail to accurately classify documents like check lists, spreadsheets, PowerPoint presentations, and many form-based documents.
- Language Dependence. Systems that seem to work fine with English documents may fail completely when presented with other languages that were not part of the original training sets or scripted rules. There can also be OCR conversion issues when there are language-specific characters that were not converted properly. Multiple languages will also cause obvious problems with approaches that are based on language-specific taxonomies. Machine translation of content may not yield the desired accuracy for classification purposes.
- Ignoring Numeric Text. Some text analytics or text search systems may completely ignore text strings and treat pages as if they did not have any numeric values. An empty spreadsheet will be treated like a completed spreadsheet if the data cell values are all numeric.
- Missing Embedded Documents. Virtually all text classification systems assign one document type classification per file, even on multiple-document files. This problem is compounded because there is also only one set of related fields or tags for properties like document date, author, subject, etc. Only one document is classified and coded. Further, systems that weight the terms used in the file compared to terms used in the file population will be weighting terms that shouldn’t be included with some of the embedded documents.
Visual file or document classification avoids all the above limitations of text-based file classification. Using what is like facial recognition for documents, files can be consistently classified without the use of text, even on enterprise-scale collections with hundreds of classifications. More at LINK.
This posting is based on the book, Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content, due out later this summer. You can sign up at https://beyondrecognition.net/guide-to-managing-unstructured-content/ to receive your copy of the book when it is available