A recent AIIM survey report, “Information Governance, records, risks and retention in the litigation age” (link), highlights issues faced by organizations in trying to manage their documents:

  • Custodian-based classification doesn’t work
  • Disc storage is steadily growing with no end in sight
  • Nobody ever seems to delete any electronic records
  • Organizations want to unify their treatment of paper and electronic records so that similar records are treated in a similar fashion regardless of file type

Instead of leaving content in relatively unmanaged environments like SharePoint or on file shares, more organizations are placing content in an ECM system like EMC-Documentum, Alfresco, or IBM-FileNet – but for ECM to work, the documents being placed in them have to be properly classified.

The good news for organizations that have already invested time classifying documents in an ECM system is that they can leverage the document classification decisions they’ve already made to avoid having to make the same decisions on unclassified documents.


Basically, many organizations have two categories of documents, one set of “records” in a managed ECM system and another set of record and non-record documents in file shares and other repositories.

managed-and-unmanaged1The goal is to identify the “records” stored in the unmanaged content and move them into a managed environment and then flag the nonrecord documents for deletion.

The suggested process starts with what you know, i.e., what the documents look like that are currently managed, and then leverages that knowledge by comparing that information with what the unmanaged documents look like.

1. Identify visual classifications of existing ECM documents

The first step is to sort the known records in the ECM into visually-similar classifications.start by classifying managed records:


After this step, the system now literally knows what the “records” look like.

2. Compare with what non-managed documents look like

The next step is to visually compare all unmanaged content of interest. The unmanaged documents that are placed in the same classifications that were formed by documents from the ECM are most likely “records,” and ought to be migrated to the ECM.


New clusters or classifications can be examined to determine if members of those classifications can be regarded as being all records or all nonrecords, or whether they will need further analysis or review to separate records from nonrecords.

review-class-samplesCollections with millions of documents typically consolidate into 10-20,000 clusters, and samples from those clusters can be examined to determine the level of treatment required for each classification. Within a few days, the project or process manager can have a extremely high level of awareness of virtually all document types present in the collection.

Visual coding rules can be quickly developed for each classification, either to provide richer metadata for each record beyond just date created or date last modified, or to help differentiate between records and non-records, e.g., looking for specific dates mentioned on the face of the documents, project names, client names, etc.

This discussion has been a high-level overview of how to use visual similarity to leverage the records management decisions that have already been made for content placed in an enterprise content management system. There are further refinements to enrich metadata values or speed processing, e.g., the use of hash values to speed the collection and analysis process, but more about those in later posts!

Comments are closed.