Faceted classification represents the collective judgment of knowledge workers or subject matter experts from multiple areas in an organization on how to classify documents and grant access to them. It is a logical outgrowth of visual classification and builds on the organization’s existing access authorization infrastructure. Faceted classification is an extremely efficient way to remove documents that are not needed by an organization and to remove exact and visual duplicates of the documents that are needed. When used as an integral part of information governance it can reduce required file storage by 90% or more.

Faceted_Classification_bullets_v01Visual classification clusters documents by analyzing visual representations of them. By basing clustering on appearance, visual classification normalizes documents regardless of the types of files that had been used to store them. Word, PDF, and TIF versions of the same underlying content all get classified in to the same clusters.

Visual classification has unique characteristics that enable collective decision-making on classification and access rights. It takes millions of documents and resolves them into tens of thousands of visually-similar clusters. The clusters are self-forming, meaning that nobody had to write rules to form the clusters and nobody had to select exemplars beforehand to use as a basis for selecting cluster members.Documents in a visually-similar cluster are so alike that classification and access decisions made by examining one or two members of each cluster can be safely extended to all members of the cluster.

So while senior knowledge workers would rarely have the time available to make repetitive decisions on document retention and access authorization, they can make decisions on how to classify one document per cluster, and can provide input on whether their business unit or function needs to have access to any particular document type, based on the group-based IT permission schema already existing in the organization. Even in large organizations the time required of individual knowledge workers is typically less than 40 hours.

The term “faceted classification” is used to indicate that multiple facets or characteristics of documents are collected as part of the process. Those facets include:

  • Cluster ID
  • Document type that the Cluster was assigned based on a user-defined three-level document type taxonomy.
  • Group-level access rights to the document type for each security group, e.g., Legal, Accounting, Refining, Exploration.
  • PII indicator for document clusters typically containing personally-identifiable information.
  • Extractable attributes which is a list of the data elements or fields to extract from documents in a document type


The workflow for faceted classification is adaptable for different circumstances but looks something like this:


BR forms clusters automatically – there is no upfront work required. Within days of processing knowledge workers can begin meeting to classify clusters and determine who will be granted access to them, e.g., HR doesn’t need to see well logs, and Exploration doesn’t need to see Refinery maintenance records.

The knowledge workers build a three-level document type taxonomy, either independently or starting from one suggested by BeyondRecognition. For each document type they determine which attributes to extract and store as metadata about that document, e.g., well number or AFE number. When they examine a cluster they decide whether it needs to be retained at all and if so, which document type to assign it.

At the same time they set up the document type taxonomy, the knowledge workers determine which groups within the organization need to have access to that type of document. The access determinations will be implemented in different ways depending on how the organization has set up its IT infrastructure. For example, document types flagged for PII may be placed on more secure storage than other document types. The key is that documents are tied to clusters which are tied to document types, and access rights are tied to document types.

A separate group of workers perform zonal attribute extraction on the clusters, basically drawing boxes around the data elements to be extracted from the cluster, based on the attribute list for that document type. Attribute values can be used to direct subsequent workflow, e.g., well logs from wells in a particular field may be placed on a different server.

Over time the work required to classify and assign access rights to documents diminishes considerably as most incoming documents will fall into clusters that have already been classified.

Comments are closed.