Imagine the internet with great search functionality but no hyperlinks. You could locate any individual page or at least have it included in extensive search results, but then you’d have to conduct other searches to find related pages, even on the same website. Not very useful, right? The point is that text search functionality alone is not a very efficient or effective way of working with or analyzing unstructured content.
Of course, most websites are highly curated with considerable thought given to providing the best navigation within each site. It is not practical to expend the same amount of effort per file to enterprise-scale unstructured content collections. Some sort of automated linking or associating of related content is needed.
The first step in enhancing the usability of unstructured content is to classify all such files and documents. With stable, consistent classifications, users don’t have to experiment with creating the perfect search syntax to find desired document types without including a lot of noise or unwanted files in the results. For example, having a consistent classification for “Agreements” would make it far easier to identify contracts with a specific party without having to view all the other documents that mention the words in the name of the party.
The next step is to make it easy to move from specific documents in one classification to the relevant documents in other classifications by identifying the key data elements or facets within each classification and then leveraging the overlapping or shared facets. For example, Agreements and Change Orders share the names of the parties and the title or name of the Agreement. This lets users navigate from a specific Agreement to the related Change Orders. Viewing the Change Orders would permit users to navigate to the form authorizing people to request or approve the change orders, all without having to text searches of all the contents of the collection. Alternatively, users could navigate from the Agreement to the Invoices submitted under the agreement. The Invoice Numbers would then permit users to obtain the amount invoiced or the dates of the invoices.
Using the key facets or attributes of documents to provide linking or associating avoids the noise or unwanted hits that can occur with unaided full text search. Each document facet becomes a dimension or path that can be used to navigate within the many-dimensional representations of documents in a collection.
Identifying key facets or attributes is made easier by the fact that within classifications the same types of data elements appear in the same relative positions in the documents and have the same flags used to identify them, e.g., the term “Seller” will be a flag that is used in the first paragraph of many Agreements right after the company name of the seller.
One big challenge to this sort of navigational approach is the necessary normalization or standardization of terms that can appear in enterprise content. Small differences in how the same underlying objects are described can cause problems, e.g., abbreviations or acronyms can cause failures to associate the same items. With visual classification all of the unique terms used for specific attributes or facets can be reviewed and mapped to the desired term, regardless of what was used in the underlying files.
User interfaces should be able to use the classification facet as a way of focusing search results to navigate within unstructured content collections without having to use far more problematic and imprecise general full text search logic.
For related postings, see:
- Converting Unstructured to Structured Content
- Text-Dependency Limitations of Auto File-Classification
- The Four Key Dimensions of Purpose-Driven Data Quality
- Using Positional Frequency to Identify High and Low-Value Words
For more information on managing unstructured content, be sure to sign up to receive a copy of “Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content” at https://beyondrecognition.net/guide-to-managing-unstructured-content/