Whack-a-mole describes a situation in which attempts to solve a problem are piecemeal or superficial, resulting only in temporary or minor improvement, as in, “the site’s security team has an ongoing battle against spammers, but it’s a game of whack-a-mole.” See Oxford Dictionaries.
The whack-a-mole concept is familiar to those attempting to classify documents using text-based rules or analytics. You spend months or perhaps years developing hundreds or thousands of complex rules to classify documents, yet false positives keep occurring, e.g., mission-critical documents get classified as invoices, or invoices get classified as something completely different.
Attempting to adjust the rules ends up in fixing the immediate problem but then causes a ripple effect with a whole new set of false positives caused by the change in the rules.
There is an answer: Use visual classification to cluster visually-similar documents. The clustering is automatic, there are no rules to write, and no exemplars or seed sets to select. The technology uses a graphical analysis of visual representations of the documents or files, not text. It works equally well on scanned or faxed images and on native electronic files. By examining the largest clusters first, subject matter experts can generally review and assign consistent document-type labels to 99% of an organizations documents in three days. Once clusters are classified, documents that are placed in the same clusters later are given the same document-type labels and provided the same treatment. Subject matter experts can focus their attention on any new clusters that form, typically a very small percentage of documents after the initial processing. Identifying document attributes to extract from each cluster is a second process involving different skill sets and it takes somewhat longer, generally measured in months for enterprise-scale collections.
While consistent classification is the primary objective of this exercise, a significant objective is to be able to defensibly dispose of the large percentage of files and documents that serve no ongoing business, regulatory, or legal purpose.
The process also contributes significantly to achieving information security in “unstructured” content. The document-type labels can be assigned using a three-layer document classification tree with the top layer being business unit or function, the second layer being document type and the third layer being sub-document type. By using this classification scheme and performing a risk assessment on each document type, organizations can determine where to store certain types of documents and restrict access to those having a business need to see the content.
For those who will continue using text-based document classification and want to stay in practice, here’s a link to iTunes where you can download Mattel, Inc.’s Whac-A-Mole e-game for iPads and iPhones: https://itunes.apple.com/us/app/whac-a-mole/id823703847
And here’s a link to Hasbro’s table-top Whac-A-Mole game (batteries required): http://www.amazon.com/Hasbro-40509-Whac-A-Mole-Game/dp/B0001GDP00
For those who would like to quit playing games with document classification, contact BeyondRecognition at IGDoneRight@BeyondRecognition.net.