…Hitting the Sweet Spot
The value-complexity curve provides a visualization of the value added to organizations by enterprise search. Initially value grows as the volume of content being managed grows. However, at some point enterprise search becomes more difficult to use as volume continues to expand and users experience increasingly cluttered and incomplete search results. Further layers of cost and complexity are then added for file classification and taxonomies in an attempt to overcome text-only challenges, but with limited success.
This posting takes a high-level look at the background of the normal value-complexity curve and suggests a way for organizations to hit the sweet spot on the curve where the appropriate level of cost and complexity yields the maximum value.
Full-text search is typically the primary retrieval tool used by enterprises but its use becomes very problematic as the collection scales. Two additional tools are typically implemented to aid in retrieval:
File classification. Consistent file classification helps minimize clutter or false positives, i.e., files the searcher doesn’t want to see. Searchers can specify only wanted document types and exclude unwanted document types. Consistent file classifications are a much more efficient at narrowing search results than the alternative of having to construct complex Boolean text searches to include the wanted but exclude the unwanted document types.
The problem with most file classification technology is that it is inherently subjective. How effective it is depends on the knowledge and the skill of the person creating the rules to identify particular document types. Even when classifications work when written (which may take some time), they can include incorrect files or miss desired files as new types of content get added to the collection that were never initially considered or tested by the experts.
Taxonomies. Another approach to helping searchers is to develop taxonomies or lists of words that are helpful in discriminating among document types. Again, these are expensive and time-consuming to develop and to the extent they were effective when developed, the effectiveness is eroded when new types of content are added to collections.
Hitting the Sweet Spot
Visual file classification technology represents a new approach to file classification. It is based on using technology to objectively group files based on their visual appearance. The grouping is automatic without consultants or others writing rules or scripts or selecting exemplars. This objectivist approach makes it possible to have truly consistent file classification.
Subject matter experts review one or two files per grouping and apply document-type name labels to them using a user-definable document-type classification tree or matrix. Once classifications are established, all future files that fall within that visual grouping receive the same classification, ensuring consistent document-typing. By viewing groupings starting with the largest ones and working to the smallest, teams of subject matter experts can review and classify over 99% of the files in an organization in a matter of a week to 10 days.
New visual groupings may form when new content is added. In that case system administrators are alerted to those new groupings so they can be classified. Consistent, persistent classifications become the bedrock upon which all downstream operations depend, e.g., setting retention schedules, determining access rights or storage locations, detecting PII, etc.
People conducting searches can reliably use document type to include or exclude certain files. Additionally, system administrators can use document types to control what files users are able to see, so search users will automatically see fewer false positives even if they don’t use document type as a search criteria.
Attribute extraction or identifying specific data elements within document types (e.g., pulling social security number off loan applications, or well number from well logs) is also made far simpler by reason of visual classification. Subject matter experts identify the type of data elements to extract from different document types and document specialists just click and drag to create zones from which those elements are pulled from each document type.
Attribution based on visual classification is much easier and more dependable than text-based classification because text-based groupings of documents contain far more false negatives and are visually far more variable within groupings or clusters.
Visual classification and document attribution provide solid, consistent data that people conducting enterprise searches can use to achieve high precision and high recall searches without the need for expensive and time consuming text-based classification systems or taxonomies. To the extent an organization wants to continue using taxonomies, visual classification can help identify terms that are unique or important in individual document types. Visual classification helps organizations find the sweet spot of the value-complexity curve.