Simple text search architecture – where every non-noise word of every document is indexed – doesn’t work well at enterprise scale. This approach consumes considerable IT resources and, from an end-user perspective, returns considerable numbers of irrelevant results for searches. This approach may work on small personal collections where it’s not too burdensome to wade through dozens of search results to find what you want, but it is unsustainable when collections are in the millions or billions of documents and result sets can be in the tens of thousands.
The initial approach to reducing index size is to try to minimize the number of files being indexed. This can be done by eliminating exact hash duplicates and implementing an effective document disposition program. It can be extended by indexing only single instances of visual duplicates, i.e., indexing only one instance of the same content stored in different file formats, e.g., the same document in Word, PDF, and TIFF formats.
Those incremental approaches are only part of an effective solution. Another approach is to revise search architecture to make use of the different values that words have when used to perform different functions (1) identifying common document types, and (2) differentiating among documents within a type. For example, in contracts, being able to analyze the words used in common contractual clauses can help separate contracts from other documents, e.g., well inspection reports or performance evaluations. However, once all the contracts are all grouped together, searching for the term “venue” won’t be of much help in differentiating among agreements as it will be in practically all of them.
With classification-based text indexing, one index is used to point to the words that are useful in differentiating among documents in the same type of document, and a second index is used to find words that are common within individual document types. The key here is to be able to first classify documents and then determine which words have low differentiation value within the classification. Those low-differentiation value terms are then used to create cluster-based document surrogates that hold the common or shared terms.
Instead of having word-level pointers to all the instances of the common words, there are document-level pointers from the surrogate documents back to the original files that contained the terms from the surrogate document.
This approach greatly reduces the number of terms in the primary text index and then, because there would be a limited number of surrogate documents per classification, makes the cluster surrogate index quite manageable. A search hit on any surrogate document could be correlated to all the documents in the classification.
Having a classification available for every file is important for several reasons:
- Context. Document types provide inherent context. For example, an invoice memorializes a sale and purchase. An engineering drawing details how something was to be or was built. A change order involves changes to construction. Each context will have different types of actors involved, e.g., buying or selling agent, engineer, construction manager.
- Results Evaluation. Being able to provide summaries of the types of documents returned by searches helps users narrow down the result set to those documents most likely to be wanted. Users can often exclude large percentages of the search results by knowing what the document type classifications were.
Search Specificity. Users waste a lot of time and ingenuity figuring out what search terms and logic to use to find specific document types without getting burdensome false hits. Consistent document typing lets them skip that aggravating and wasteful process.
- Screened File Access. With consistent classification, users’ access to enterprise files can be limited to the type needed to perform their job function, greatly reducing not only the results they have to review to find what they want, but also reducing security risks.
- Disposition. In the absence of consistent classification, scheduled content disposition is mostly an aspirational goal, not a reality.
For more information on managing unstructured content, download your free personal copy of the book, Guide to Managing Unstructured Content at: https://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/
“Disambiguation and Role Determination in ‘Unstructured’ Content,” (April 2015): https://beyondrecognition.net/disambiguation-and-role-determination-in-unstructured-content/