“Unstructured” content is a term used to describe content stored on file shares, personal computing devices, and content management systems. A major challenge to making effective use of such content is that words can have multiple meanings, and a name can refer to more than one person. Even worse, there can be multiple forms of the same name. Even when there is only one name for a person, that person can have multiple roles in an organization, e.g., as supervisor, employee, United Way volunteer, scientist, inventor, profit center manager, health coverage insured, or a volunteer in a professional association. Documents related to some roles may be irrelevant for many purposes.

InfoSecurity_Risk_Classification v02The end result is that trying to find relevant documents in unstructured content often results in too much clutter, too many irrelevant documents. Effectively using and searching unstructured content requires “disambiguation.” That is a term made popular by Wikipedia which has disambiguation pages to help users decide which article to read when multiple articles use the same term, e.g., “Mercury” as an element on the periodic table, as a planet, or as a character from mythology.

People trying to find specific content need to be able disambiguate terms and people and be able to identify when people are acting in particular roles. Fortunately, organization use specific types of documents to initiate or record specific actions, and knowing what document types were located as a result of a search is invaluable in identifying the actions that are described or recorded in them, and what roles were involved. For example, a vacation request form involves a manager and an employee, an authorization for expenditure involves a profit center manager, and a lab report may involve a lab worker and a scientist.

Knowing document types tells us the contexts in which terms and names were used and those contexts help determine which documents are worth further examination. How names appear within document types also indicates what roles people are playing, e.g., on a vacation request form, the employee’s name appears in one area and the manager’s name appears in another.

Of course all of this is irrelevant if there is no way to consistently determine what document-type labels to associate with the documents in a collection. Fortunately, visual classification automatically groups or clusters both native electronic files and scanned paper documents based on their visual similarity, and document type labels can be permanently designated for those documents with only a modest amount of effort.

The document type field or metadata value then becomes an extremely powerful tool in differentiating between documents of potential interest and documents of no interest. Document types can be displayed in results or can be used as part of the search criteria.

Zonal attribute extraction within clusters of visually-similar documents is a powerful way to associate names with specific roles or extract other document attributes. As with the document-type field or metadata value, these other document attributes can be displayed in search results or used as search criteria. Because all the values for specific attributes are extracted from the collection, term normalization or tables of equivalent values can be employed to improve the consistency of the underlying data elements. For example, there may be several ways that “well numbers” are entered (Well Number, Well#, API#), and once all of the entries are extracted and examined, strategies can be developed to normalize the entries or otherwise overcome the variability in naming conventions.

Visual classification and zonal attribute extraction are invaluable in disambiguating search terms and identifying the roles in which names appear. Search efficiency and effectiveness can also be improved at an administrative level by permitting employees to see only those documents that pertain to their job function. Such access control can go a long way to removing clutter from search results.

Of perhaps more importance than assisting search, visual classification also permits the assignment of enforceable retention periods  for specific document types, making it possible to help stem the ever-growing volume of retained documents. For information on how BeyondRecognition can assist your organization in managing its “unstructured” content, please email us at IGDoneRight@BeyondRecognition.net.

Related Content:

Blog, “Documents ARE Structured, Just Heterogeneously

Blog, “How to Detect, Define, & Use an Enterprise Document Attribute Matrix.


Comments are closed.