Without the right tools, even basic information governance tasks can be difficult. The most glaring example is document classification which is the bedrock upon which virtually all information governance initiatives rest. If you can’t accurately classify an ever increasing volume of documents and correspondence, you can’t apply the correct retention schedules, you can’t specify which attributes to extract, you can’t separate records from non-records, and access privileges will necessarily be overly broad or overly restrictive. It is no overstatement to say that without accurate document classification, records management and information governance initiatives may be more aspirational than achievable.
Until recently there were four equally unattractive document classification options:
- User-assigned classifications. This has been a total flop – users can’t or won’t accurately classify the content they create. While this option can be used to prove that some form of automation is necessary, it is not a viable alternative in the medium to long term.
- Rules-based technology-assisted classifications (“TAC”). This is a tantalizing option, one that requires the process owner to identify what document types require classification and then requires the engagement of high-level consultants to develop the classification scripts or rules. The results are typically characterized by under and over-inclusive classifications, and with classifications that become out of date as the document types in a given process evolve over time. One advantage of this approach is that it could at least attempt to identify which textual attributes to extract and use for subsequent searching and reporting.
- Exemplar-based TAC. Another tantalizing option, one where the process owner has to identify document types to classify, then find exemplars of each type so that the technology can attempt to identify or predict which other documents would be “like” the exemplars. As with rules-based TAC, results were often over- or under-inclusive and it was challenging to update the classifications and exemplars as new document types were introduced into the collection or process and old document types morphed over time. Many exemplar-based classification technologies leave clients in the lurch when it comes to actually extracting document attributes for subsequent structured searching and analysis.
- Manual coding. Having people examine documents and extract attributes or data elements of interest is extremely costly and carries a heavy upfront investment to create a coding manual with data input forms appropriate for each document type. One of the big challenges here is that there is a limit to how many classifications or rules that humans can keep in their heads at one time, so consistency and currency are two big challenges, plus the often unstated security risks posed by having to outsource much of this work, and the time delay inherent in manual data entry and QA of the work.
The Cloud-Based BPO Option
Information governance professionals now have a fifth option: visual classification of documents. This is profoundly different from the other approaches for three reasons.
- Self-forming classifications. The visual classification algorithms do the work of associating like documents. There are no up-front costs to define what document types there are in the collection by either writing rules or identifying exemplars. As new document types are processed or document types morph over time, operators will be alerted that new classifications have formed. The heavy lifting involved in starting or maintaining a document classification process are practically eliminated.
- Graphical, not text-based, document comparisons. Rules-based and exemplar-based TAC rely on the documents having text. Unfortunately, not all documents have enough text or accurate-enough text to permit accurate classifications. Legacy TAC approaches will fail to classify as much as 30% or more of the documents in many industries. (Would you fly an airline with a successful landing rate of 70%?) Visual similarity is based on examining what documents look like and will classify scanned paper documents equally as well as native electronic files. One consequence of this is that paper and electronic silos can now be consolidated for analysis and receive consistent treatment.
- Classification-specific attribute extraction. With visual similarity, operators can “paint” or draw boxes around the attributes they’d like to extract from each classification by examining one document per classification. The boxes they create are applied to all documents in the classification, enabling them to effectively extract attributes from all the documents by examining what is generally around 1% of the number of documents in a collection. Because the attribution is persistent, with this approach companies can leverage the “stored intelligence” of previously classified data to automatically classify new content that that comes into the environment.
Cloud-based Visual Classification
Visual classification is available from BeyondRecognition, LLC (“BR”), on a cloud basis. Clients can FTP their document collections or can send drives to BR’s secure facilities. Computing resources can be scaled to achieve turnaround times of millions of documents per day. Confusion in, classifications out.
BR’s document classifications are persistent, meaning that as new documents are added to the process they can be associated with previously established classifications and the attribute extraction boxes for those classifications will extract the attributes from the new documents. The significance of this is that within a day of beginning of processing, subject matter experts can begin assigning name labels for the document type in each classification, and can begin painting the attribute extraction elements.
Within classifications, BR can also identify visual duplicates, documents that are visually indistinguishable from one another. This functionality is far more effective at reducing redundant content than the far more common but limited approach of using hash values that fail to identify that the same content has been stored in different file types, e.g., Word documents saved as PDF documents.
The benefits of the cloud-based visual classification approach can be summarized as:
- Rapid project launch – days instead of months.
- Self-forming classifications greatly simplify handling shifting or morphing document types over time.
- Even non-text documents are accurately classified.
- Security risks associated with having large numbers of people analyzing and reviewing a company’s documents are eliminated.
Regardless of the type of IG initiative you are conducting or considering, if you haven’t examined BR, you haven’t completed your due diligence in examining the most powerful options available to you.