BR_D2_Data_Cleaner_v4Organizations should perform periodic “spring cleaning” of their document management and file share systems to ensure they are keeping needed records while disposing of duplicative or unnecessary documents.

A logical time to do this is when upgrading content management systems, e.g., moving to EMC Documentum’s D2, or when increasing the scope of managed content. The files have to be processed anyway, so why not take the opportunity to implement consistent, efficient practices?

normal-process-v05_trThe traditional way to decide what content to place in a document management system is to examine individual documents and decide what to include or exclude. The biggest problem here is that much effort can be spent reviewing documents that are ultimately discarded. Without some way to automatically group or classify documents, this is an extremely labor-intensive, possibly cost-prohibitive, process that will be difficult to apply consistently going forward.

One way of overcoming the problem of wasting effort reviewing documents that are ultimately discarded is to use visual similarity technology to classify or cluster documents. The classification is automatic – no manual resources are required to achieve the classification. Visual classification normalizes content across multiple file types, e.g., a Word document will be grouped with the PDF that was printed from the Word document, and the grouping will include scanned images made from printed files.

br-process-v05_trWith visual similarity, native and scanned paper files are grouped according to their appearance. This process can be used not only to evaluate unmanaged content, but also to validate the current contents of the ECM system being upgraded. In a recent energy company project, we found that two of the top five classifications by frequency were actually documents for which the company had no ongoing retention requirements – they were literally wasting space.

Consistent Document Type Labels. One of the challenges facing large organizations is that different business units may use different labels to refer to the same documents, and may have different retention needs for those documents. Visual similarity provides a way to let each business unit use its own labels and apply their own retention requirements.

The classification provided by visual similarity serves as “true north” for classification purposes – these classifications are consistent and persistent on a day-forward basis. Business units can apply whatever label they want to for each classification and the access privileges for those classifications can be set according to the needs of each unit – no need to make a duplicate set of documents or set up different retention periods for what are exact copies of the same documents. If one business unit wants to have 180 document types with a retention period of six months and another unit wants 540 document types with a retention period of 36 months, the needs of both units can be satisfied – after six months the one business unit will simply stop being able to see those documents.

Single Object Management. Because visual similarity technology can detect duplicates (even duplicate scanned paper documents), it can be used to achieve far higher levels of deduping than is possible with hashing technology – and this works despite possible differences in scanning resolution or orientation. Visual similarity can also be used to do things like correlate which Word file was used to print which PDF, and to then embed the Word file in the appropriate PDF to further minimize the number of objects being managed.

Visual Coding. At the same time that visually similar classes of documents are being examined, the examiner can identify which data elements to extract from the documents that will be retained in the ECM system – it can be literally as simple as drawing a box around the data to be retained and associating that with a field or data value name. The extracted values can be validated against external controlled vocabulary lists or to populate such lists. The visual coding process can be used in lieu of outsourcing coding work, thereby providing faster, more accurate turnaround without the data security issues inherent with outsourcing.

