Writers in the information management space often speak of structured vs. unstructured data and then analyze documents as if they were “unstructured.” However, when documents are clustered by visual similarity, they are actually fairly structured within clusters, e.g., invoices, letters, and emails each have recurring attributes or data elements located in generally the same place in the documents in that cluster.
It is actually more accurate to say that documents are heterogeneously structured – once they are clustered into groups of visually-similar documents, there are recurring attributes or data elements in that group or cluster.
Because many documents have no text or poor-quality text, they key is to be able to cluster based on visual similarity so that the clustering can address all the documents in a collection, i.e., native files as well as scanned paper copies, documents with text and documents without text.
BR’s visual classification technology has several advantages over text-restricted technology:
- Rapid project startup and completion – BR’s visual clusters are self-forming, there are no rules to write to try to force documents into classifications, no exemplars to identify, or seed sets to select.
- Scalability – BR can simply add cores or servers to scale the process.
- Document unitization – accurate classification, analysis, and subsequent retrieval depend on having proper document boundaries. BR learns what the first pages look like in a collection and can then create logical document boundaries within files. BR’s approach has been shown to be more accurate and consistent than manual document unitization. See, “Information Governance Lessons from 4 AFEs and a Daily Drilling Report.”
- Attribute extraction – within clusters, BR can extract attributes that are displayed in the document, e.g., well number, date, or geotags. Far faster and typically more accurate than manual coding.
Conclusion: All information management initiatives depend on having accurate, consistent classification of the documents being managed. Visual similarity is the only technology that is able to comprehensively classify all the documents in an organization.
- Sampling Resolves Conjecture on Significance of Non-Textual Documents
- Measuring Text Bias/Tunnel Vision in Content Search and ECM Systems
- “Predictive” Coding and the Naked Emperor