Technology-Assisted Review (“TAR”) for e-discovery processing has received a fair amount of favorable publicity over the last several years, with extensive claims of statistically-sound measures of things like precision, recall, fallout, and f-measures. What may not be explicitly stated is that the “Technology” in TAR is limited to some type of textual analysis, i.e., TAR is really Text-Assisted Review, and actual performance may fall short of expectations.
Measuring what wasn’t processed. To estimate true recall you would have to know what percentage of the original document population wasn’t loaded into the document review system in the first place because those documents did not contain searchable text. For example, some of our energy clients have told us that as much as 30% or more of their documents do not have associated text, e.g., maps, image-only scanned documents, drawings, designs, and pictures. In work flows that involve using text searches to identify potentially responsive documents to process, or in which image-only/non-text documents are simply not loaded, those non-text or poor-text documents will never be processed or reviewed. A claim of 80% recall may really mean 80% of the 70% of the documents that had text, i.e., 56% true recall.
Non-random distribution of responsive documents. Another point to consider when measuring TAR systems is that statistical measures usually assume a random distribution of responsive items, but that assumption is simply not true for document reviews – some document types are far more likely to be responsive than others and other document types are just about guaranteed to be nonresponsive. In the old days if a lawyer went into a document warehouse looking for responsive documents, he or she would just skip all the boxes of invoices, standard reports, and other non-relevant documents.
Credit where none is due. Many times proponents of TAR efficiency are claiming credit for TAR identifying as nonresponsive document types that would never have been even considered for review in the first place if reliable document type classifications had been available. Not having been able to remove them early on, TAR wants credit for removing them after further processing costs have been incurred.
The key: the meaning of “document.” The key to understanding the difficulties with TAR systems is to appreciate the fact that a “document” is a record or communication intended to be comprehended by the reader or recipient based on a visual examination of that record or communication, either when viewed on screen or when viewing a printout of the document. The format of the document provides the context within which meaning is derived from the document. To ignore the visual context is to lose much of the meaning of the document.
Automatic visual classification. BeyondRecognition provides technology that classifies paper and electronic source files based on their visual similarity, not on an analysis of the associated text values. The process is completely automatic and extremely reliable – there are no false positives. Examining one document from a classification enables a senior reviewer or subject matter expert to determine if all the documents in a classification can be safely ignored, deemed always responsive, or if they will need further review.
1% Rule. Typically the number of classifications will be about 1% of the number of documents, although different collections may vary. This means that a senior reviewer can quickly gain a sense of awareness of what types of documents are in a collection, can make decisions on how to treat all the documents in many classifications, and can triage the documents that do require further review, e.g., send medical records to a nurse paralegal. The senior reviewer will see at least one document from classifications that might have been missed completely by the small sample sizes often used by TAR systems in cases with relatively large document populations – even if that classification had no text or poor quality text.
Visual duplicates. BR’s ability to identify visual duplicates (e.g., Word, PDF, and TIF representations of the same documents) will accelerate the review of those classifications that do require further review.
Quick start. BR is a very scalable solution and can process hundreds of thousands of pages in a day to get a project off to a quick start. Rather than hypothosizing or predicting what might be in the collection, the senior people can examine all the document types and manage the litigation from a state of awareness from the very outset.
Case Study – 10M Documents for Energy Client
10M Docs, 3.5K Classifications. In a recent information governance project for an energy client, BR processed over 10 million documents, including things like well logs and seismic charts. Approximately 40% of the documents were graphical in nature, e.g., maps, graphs, diagrams, maps, and designs. The documents had been created over a 50-year time span. BR classified those documents into a little over 3,500 document types. Subject matter experts reviewed each classification and determined whether that classification needed to be retained as a company “record,” and for records that were being retained, subject matter experts used BR’s zonal attribute extraction to pull desired data elements from those classifications.
Zonal Attribute Extraction. In a discovery review process, the document attributes would be available for field-limited searching to narrow queries, and would also be available to create formatted reports or for further analysis. BR’s automated glyph-to-text editing and quality review tools yield fielded metadata that is typically more comprehensive and accurate than could be obtained by manual data entry.