“Unstructured” content is a term used to describe the seemingly infinite types of documents that can be found in file shares and personal computing devices. In my last posting I considered some of the differences between structured and unstructured content and discussed how enterprise content management systems represent an attempt to provide the advantages of structured systems for managing “unstructured” content. This posting suggests a best practices approach to doing that: classification and attribution.
The objective in managing unstructured content is to be able to overlay or impose some type of structure or organization. The wide variety of types of documents causes two types of issues: (1) how to classify those documents into groups of similar documents, and then (2) how to differentiate among the documents within a group.
First Issue: File Classification
Once files are accurately classified, virtually all basic information governance tasks are made far easier. Content managers can:
- Discard unneeded files
- Set retention periods for preserved records
- Designate who has access to them
- Determine where they should be stored
- Detect & protect PII
Having consistent classifications tells us a lot about the types of information in the documents within those classifications, and enables us to differentiate among the documents sharing a classification. For example, it will be useful for many purposes to have distinct classifications for agreements, invoices, inspection reports, and tax filings. For tax filings, the following attributes will be useful:
- Tax year
- Tax jurisdiction
- Type of tax
- Reporting entity
For inspection reports the following attributes may be useful:
- Type of inspection
- Date of inspection
- Inspector name
- Inspecting company
- Property or item inspected
The good news in performing the high-level classification is that documents that perform similar functions within an organization tend to look alike. While there is no one structure for all the documents in a collection but there are definable structures within the individual groupings. In other words, “unstructured” content is more accurately described as heterogeneously structured.
The causes for similarity within document types include:
- Use of templates
- Organization or industry conventions or practices
- System-generated reports and notices
- Use of vendor- or third-party-provided documents
- Traditional Text-Based Classification
Traditional text-based approaches attempt to perform both classification and attribute extraction functions based on an analysis of the text values within the files. The typical project workflow looks something like the following:
The key points of the text-based workflow are:
Development Time: The development time can take several months to a year or more, most of which is spent on obtaining consistent classifications with minimal false positives. Once consistent classification is achieved the process of attribute extraction is considerably easier.
Tuning: Text classification involves establishing and maintaining a type of scoring ecosystem. For example, one way to be sure that no emails are classified as something else (i.e., no false negatives) is to classify everything as emails, but that causes numerous false positives. With text-based systems tuning the classification rules is an ever-expanding, never-ending problem. The larger the collection the more difficult this becomes.
Text Limitations. Text-based classification is difficult enough assuming all files have high-quality text and all file have just one document with all its pages. They don’t.
Cascading Error Rates. If the initial file classification is incorrect, the attribute extraction process may not attempt to extract all the right attributes, causing attribution error rates to balloon. As with classification tuning, error correction is a never-ending, ever-expanding problem that worsens as collections grow.
New Classification Technology
Visual classification technology involves self-forming groups of visually-similar documents. The use of visual appearance normalizes files regardless of file type, e.g., Word, PDF, faxed, and scanned versions of the same documents will cluster together. Furthermore, visual classification does not require or use text values per se and can be used to determine logical document boundaries. This greatly reduces classification development time as well as greatly eliminating ongoing QC requirements.
Differentiating Among Files within Document Types: Attribution
Extracted file attributes are data elements that differentiate documents from each other within the clusters of visually-similar documents, e.g., the names of borrowers on loan files, or API well numbers on well logs. Once files are clustered based on visual similarity, attribute extraction is considerably easier because of the consistency of content among the clustered files.
Guide to Managing Unstructured Content
For more information on managing unstructured content, you can download your free, personal-use copy of our Guide to Managing Unstructured Content at: http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/