Technology-Assisted Review (TAR) or Predictive Coding (PC) is an attempt to minimize discovery review costs by minimizing the number of review decisions that have to be made by the attorneys. TAR/PC proponents point out that TAR is generally as effective as the “gold standard” of human review in identifying relevant records. However it has at least two soft spots compared to what is achievable with alternative technology: text dependence & document unitization.
Virtually all TAR offerings are based on some form of text analysis which examines the frequency and patterns of words within individual documents compared to frequencies and patterns of words in other documents or the corpus as a whole. The soft spots in the defensibility of TAR arise from the indisputable facts that not all document files have associated text or the quality of text needed for TAR to work effectively and quite often document files are actually aggregations of what were originally separate documents.
First the issue of text dependency.
No Text. In many corporate collections, significant percentages of documents do not have associated text. Examples include files saved to PDF from within applications that don’t add text layers to the PDFs, and scanned documents where the TIF or PDF files do not contain text.
It’s simple: no text means no text analysis, no textual clustering, no text searching, no text anything. While there may be some textual metadata associated with the non-textual documents, that metadata provides far less granularity and far less useful differentiation among files than the content discernible to people on the face of the documents. Regardless of how well TAR may deal with files with accurate textual representations, the fact remains it won’t do anything with documents with no associated text.
Poor Text. Text-based engines struggle where the text associated with document files is of poor quality in terms of what the engine needs for optimal performance. Examples where this occurs include:
- Foreign language documents where language detection and issues in tokenizing compound words can prevent the engines from grouping documents that most people would consider to be similar. Furthermore, groups or clusters in one language may differ completely from the same documents in other languages, causing duplicative review.
- Numeric data where text strings made of only numeric characters may not contribute to the weighting or evaluation of the content of documents, despite the fact that numeric data may be critical in many cases.
- OCR-generated text where the quality of the OCR’d text varies widely with the quality of the original document image. The OCR’d version of a document may well not be evaluated the same as the original native file because of OCR errors.
- Short or long documents. There are indications that some text-based systems perform poorly on short documents and on long documents (http://www.cs.umn.edu/tech_reports_upload/tr2013/13-004.pdf).
Then there’s the document unitization issue. When people distribute content electronically, they often do it in PDF format, and as a convenience to the recipients and themselves they aggregate what were originally multiple different documents into one PDF. Sometimes it’s to make it easier to email or distribute a collection of documents, sometimes it’s so the sender can bookmark the whole collection. Whatever the reason, many times individual PDFs are really container files for multiple documents.
In one example we ran across, four authorizations for expenditure (AFEs) were lumped into one PDF that also included a daily drilling report. One of two things will happen with TAR systems: either the PDF will be evaluated as an AFE or as a daily drilling report. Either way the other document or documents essentially become invisible to the text-restricted software.
The document unitization issue can be especially problematic with scanned paper documents where entire boxes may be scanned as one multi-page file.
A Better Approach
There is a better, more defensible approach to document review: visual classification. As suggested by the name, visual classification clusters or groups documents based on their visual appearance, noton the presence or absence of associated text. In other words, visual classification is not a text-dependent technology. Text can be an important attribute of documents for other purposes, but is not needed for classification.
Attorneys review clusters of documents to decide if they are apt to include relevant documents or not. Many clusters or groupings can be eliminated because they are simply irrelevant to the matter at hand, e.g., in an employment discrimination case, invoices are undoubtedly irrelevant. Decisions are made on the basis of an awareness of the documents in the clusters being evaluated, and the review iscomprehensive because it analyzes all the documents in a collection, not just those with adequate textual representations, not just those of a particular language, not just non-numeric files, and not just average length documents
Page-Level Classification to address Unitization Issues. While TAR operates at a document-object level, visual classification analyzes or groups individual pages as well as documents. Visual classification learns what the pages look like that begin documents, and those beginning of document (“BOD”) pages are tracked, even when they fall within the pages of longer documents. When attorneys select a cluster or grouping for inclusion, the system can select the larger documents in which those pages occurred, providing for far greater inclusion of potentially relevant content. The daily drilling report buried in a single “document” along with four AFEs does not have to disappear. Visual classification technology can be used to disaggregate or split apart the combined documents, although there may be legal or business reasons why a client may not want to split combined documents but instead leave them as collected.
Visual classification can also be augmented with Find technology which searches for text or glyphs (i.e., graphical elements), so that documents that contain certain concepts can be included whether or not they had been included within the documents that had been selected based on visual clustering or grouping.
For more about the use of visual classification for document review, see www.BeyondReview.us.
- Information Governance Lessons from 4 AFEs and a Daily Drilling Report
- “Predictive” Coding and the Naked Emperor
- Measuring Text Bias/Tunnel Vision in Content Search and ECM Systems
- Is Your TAR Really Text-Assisted-Review? (And Why It Should Matter to You)
- Sampling Resolves Conjecture on Significance of Non-Textual Documents
- Basic Assumptions Gone Wrong: ECM and Document Unitization