Documents in file shares, content management systems, and scanned archives are often described as “unstructured.” However, there is typically a high level of structure in and interconnectedness among those documents. This structure and interconnectedness occurs because specific document types contain recurring attributes or data elements and those attributes or data elements are shared with other document types.
The relationships can be represented in a matrix with document types shown on the top of the matrix and document attributes or data elements down the left side. In the simplest form, check marks can be used to represent which data elements are typically found in any given document type. In a large enterprise there may be hundreds of document types.
The following graphic shows a few document types with a few attributes as an example of how the document types and attributes are essentially woven together to form the information fabric of an organization:
Using Matrix Information
A major use of information from the document attribute matrix is to make content management systems more useful and efficient by providing fielded data to search and to use in creating reports about select documents in the CMS.
The cells in a document attribute matrix can also be mapped to fields and tables in business process systems. For example, a financial institution may have a database-driven loan processing application with a field to indicate whether a borrower authorized a credit check. The entry in that field should be supported by a credit check authorization form signed by the borrower.The credit check authorization form is accessible via the matrix by looking in that document type “credit check” for either the loan number or the borrower’s social security number.
Detecting and Defining the Matrix
Identifying the document attribute matrix being used in an organization is an intensely practical exercise, practical because it has to be data-driven, i.e., based on the reality of what attributes are actually used, recognizing the vagaries of how documents come into existence and are transmitted to the enterprise. While the organization can define the document-type labels it uses and how they’re organized, and can define the labels to use for document attributes and how they are formatted, it basically has to deal with the data elements that are actually present, at least in the near term.
First Step: Classify Documents by Document Type
The key to uncovering the de facto document attribute matrix that exists in an enterprise is to start by identifying the document types that populate the matrix. Visual classification technology provides the vital means to accomplish that ends. It automatically clusters all documents, native electronic and scanned paper, by their visual appearance. This graphical analysis approach essentially normalizes content regardless of the types of files that may be used to store individual copies of documents, regardless of the amount or quality of text associated with it. For example, visual classification would group all of the following documents in the same cluster:
- A Word file used to create the original document.
- A PDF created by saving the Word file to PDF format.
- A scanned TIF image of either the Word or PDF file that had been printed to paper and then scanned.
- A faxed image of the document.
The number of visually-similar clusters in a document population is typically less than 1% of the number of documents, and the largest clusters contain most of the documents in an organization. Because the documents in a cluster are all alike, the entire cluster can have a document type assigned based on a review of just one or two documents per cluster. In most business units, well over 99% of the documents can be consistently classified within 3 days time.
The following graphic shows how order or structure is created from seeming chaos by visual classification. Each cluster can be reviewed, disposition decisions implemented, and document-type labels designated for clusters that will be retained.
One of the significant benefits of the classification step is that when organizations are able to designate non-record clusters for disposition, they free up considerable storage resources and make the remaining content much easier to work with.
Second Step: Identify and Extract Document Attributes
Once documents have been clustered visually, attributes or data elements can be extracted from them by clicking and dragging a box around the data of interest on one document in each cluster, with data being extracted from all members of the cluster by this action. When data elements are selected for extraction, the operator indicates the field in which to place the data and indicates how the data are to be formatted.
Once data elements have been extracted, they can be normalized and loaded into whatever content management system or business process support system is in use.
For more information on how BeyondRecognition’s visual classification technology can help you in detecting and using the document attribute matrix existing in your organization, email us at IGDoneRight@BeyondRecognition.net.