Different IG stakeholders look at ECM systems differently. IT may be primarily concerned with the resources it takes to store, access, and backup the ECM content and the time it takes for the system to respond to a query. They are apt to view the content in ECM as a given and use technology like block-level deduping to reduce the resources needed to manage the content.

The acid test for end users is more likely, “Does the ECM help me find what I want quickly, and painlessly?”

TooMuchClutterEnd users care about seeing just one copy of the same content, being able to frame precise queries, and being able to rank or order search results in an easy, intuitive manner. To them, storage optimization techniques like block-level deduping are largely a matter of indifference. All those techniques may do is enable ECM systems to do a better job burying them when they try to find what they want.

Example: One Document, x10 Copies.

Consider the example of a Word document that is saved as a *.doc file by one user and as a *.docx file by another. Both users then save a PDF version for distribution, one user prints a paper copy that gets scanned twice, each time at a different resolution. And then both scanned copies get saved as image-only PDFs that are then saved again with associated text.

To end users, the 10 files all represent the same content. When IT saves some of the space it takes to store them using block-level deduping, it doesn’t shield the end users from having to wade through all those copies.

File-Level Hashing. IT is often able to use file-level hash deduping to good effect, removing files that are bit-for-bit identical. However, as you would expect, each of the 10 copies in this scenario has a different hash value (see table at the end of this posting), preventing hash deduping from addressing this problem.

The Answer. Visual classification provides a scalable, practical solution to the challenge of removing redundant content and permitting precise searching with meaningful sorting or ranking of results. Because visual classification analyzes graphical representations of documents, it does not need recognizable text to group visually similar documents. All 10 copies of the document in this scenario would be grouped together, i.e., all the *.docx, *.doc, *.tif, and *.pdf documents would all be in the same cluster of visually-similar documents.

BR clients assign document-type labels to the clusters and this information is loaded in the ECM, permitting users to perform searches for only the document types they are interested in. Depending on how the ECM is configured, visual classification also enables end users to sort search results by document type, enabling them to quickly identify the documents of most interest. The document type classifications can be used to set access rights to content in the ECM system and keep users from even seeing documents for which they have no business reason to access.

Drucker and Martin v02Visual classification can also identify “visual duplicates,” i.e., copies that are visually indistinguishable from one another and enable business units to select which versions to retain. For example, with the document type “Agreements,” the business unit or the legal department might want scanned copies of executed documents, and hence want to keep PDF versions of scanned copies.

Finally, visual classification can involve attribute extraction in which certain data elements are extracted from specified document types and loaded in the ECM system, enabling further field-limited searching and output sorting and reporting options.

Implementation. One of the big keys to success in implementing visual classification is the active involvement of the business units whose content is maintained in the ECM system. The business unit is the stakeholder that has to identify which document types to keep as records, what document-type labels to associate with given clusters, and which attributes to extract. As in most IG initiatives, it takes a multi-disciplinary team to achieve the best results.

For information on how your organization can implement visual classification, contact BeyondRecognition at IGDoneRight@beyondrecognition.net.


Comments are closed.