(To download a PDF version that you can read offline, click here.)
Technology Assisted Review (“TAR”) and visual classification take two different approaches to classifying documents. TAR uses the text associated with the documents being classified while visual classification bases its analysis on graphical representations of those documents.
TAR is an outgrowth of tools designed to extract meaning from collections of textual content. Visual classification was developed as an enterprise-scale information governance tool for completing document-centric initiatives like content migration, archive digitization, silo consolidation, and archive digitization.
The different approaches and origins of TAR and visual classification lead to major differences in awareness, comprehensiveness, transparency, repurposing work product, attribute extraction, redaction, and correcting document boundaries among other things.
Technology-assisted review (or “TAR”) and visual classification have the same overarching objective in litigation or investigative contexts: to defensibly determine which documents or electronic files are responsive to a given set of requests with the least effort. They go about meeting those objectives in markedly different ways that yield substantially different benefits and options. This document describes those differences.
First we will define some terms and then examine the differences.
II. Different Underlying Technology
The phrase Technology Assisted Review (or “TAR”) has come to be associated with a process that uses text analysis to extend decisions made reviewing a subset of documents to an entire population of documents. Depending on the specific TAR application, the initial set of documents to be reviewed (the “seed” set) may be selected randomly or chosen by other techniques such as witness interviews or full text searching. TAR applications use weighting algorithms to analyze the patterns of words used in the documents that have been designated as responsive compared to the patterns of words in the nonresponsive documents, and then rank the unreviewed documents that are most like the initial responsive documents. Depending on the specific TAR system, there can be several training sessions during which TAR rankings are compared to human review decisions to determine how accurately the algorithm is predicting the review decisions.
Text tokenization.TAR systems are not all alike. Some TAR systems feed their algorithms only certain types of data. For example, some ignore numbers, some use primarily sentences, others may use primarily noun phases. Differences in how text is tokenized or passed to the algorithms can impact how things like spreadsheets, presentations, and lists of items are analyzed and compared. Organizations considering TAR systems ought to learn what method is used by the systems being considered.
Use of subsequent review decisions. Depending on the system, decisions made on the documents queued up for subsequent review may be taken into consideration for selecting or queueing up other documents for review, and in some cases there may not be further human review of the documents selected as being most like the responsive set of documents.
Non-textual or poor text documents. When documents do not contain text, e.g., image-only PDFs or scanned documents, TAR systems may attempt to differentiate between responsive and non-responsive documents by analyzing the metadata associated with the documents, e.g., date created, file size, original folder path, author, etc.
Statistical Sampling. The TAR approach is commonly associated with statistical sampling of the non-responsive documents to estimate how many responsive documents may not have been selected.
B. Visual Classification
While TAR is based on text as augmented with metadata, visual classification automatically groups documents based on a graphical analysis of their appearance, regardless of whether or not there is text associated with the documents.
Figure 1. Visual Classification Groups Visually-Similar Documents
Tokenization. Visual classification considers all visible glyphs or graphical elements, including those associated with numbers, letters, and punctuation. Visual classification also groups glyphs at multiple levels including word, sentence, paragraph, and page.
Cluster size graphs. The number of visually-similar groupings is usually less than one percent of the total number of documents, and the frequency distribution of the number of documents in the groupings is such that a relatively small number of groupings contain the vast majority of the documents in the collection.
Figure 2. Frequency Distribution of Number of Documents in Clusters of Visually-Similar Documents
Documents within the same cluster are very similar and review decisions can often be made for all the documents in a cluster by reviewing one or two documents per cluster. The review decisions can be to exclude all the documents, include all, or to review further.
Document-type labels. When clusters are reviewed, document-type labels can be designated for each cluster or certain clusters. The document-type designations can be used for many purposes as discussed later.
Figure 3. Clusters Reviewed for Responsiveness and Assigned Document-Type Labels
Concept searching. Following cluster-level review, the entire population can be searched using enhanced text searching for concepts to identify further documents that may be responsive. Even if documents were in an excluded cluster or grouping they can still be considered for review. In other words, there are two ways documents can be considered for further review, either based on visual clustering or based on concept searching. So while visual classification does not use or require text for visual classification it can use it for other purposes like concept searching or attribute extraction.
As might be expected, the different approaches taken by TAR and visual classification result in several significant differences.
With visual classification, reviewing one or two documents per cluster starting with the largest clusters gives the senior team a personal awareness of the types of documents in a collection and their relative frequencies. This personal awareness is invaluable in making practical decisions on how to proceed with the matter, the type of team members to use (e.g., nurse paralegals for medical documents, accounting background reviewers for financial records), and how to negotiate with opposing counsel, and how to explain to a court what the collection contained and how it was winnowed down to the set of documents that is ultimately produced.
With TAR, the review team doesn’t have this sense of what the forest looks like compared to individual trees or leaves. The team may have a summary by general file types such as Word, Excel, or PDF documents, but that is far less granular and far less helpful. For example, the same content can be displayed in virtually the same format in Word or PDF, so knowing the frequency of a general file type like “PDF” doesn’t really mean anything other than, “I don’t know.”
Net-net: Visual classification provides senior project leaders with a far richer awareness of the type and number of documents in a collection from the very beginning of the project.
In TAR systems, words have a one-dimensional order with each word either in front of or behind other words on a sort of virtual ticker tape. There is no way to analyze things like logos, graphics, form lines, signatures, page orientation, and how things are arranged or placed on a page – the sort of thing that authors spend long hours composing and adjusting to convey the correct meaning. TAR, being text restricted, is not able to derive meaning or group documents based on those rich visual cues, and even where documents have associated text, its analysis will be far less nuanced. In a sense, it is like facial recognition that works only on words that can describe a face.
Visual classification is the only technology that will accurately classify documents regardless of the presence or absence of text in the documents and regardless of the use of complete sentences. This means it is far more comprehensive in its treatment of all the documents in an organization. The text-based TAR approach will be left to rely on OCR accuracy or on the use of metadata associated with image-only or poor-text quality documents in its attempt to differentiate between responsive and nonresponsive documents.
Having become familiar with their own documents by reviewing clusters of visually-similar documents, counsel using visual classification are in a far better position to be transparent with opposing counsel and the court than those using TAR. Counsel can provide accurate descriptions of what types of documents were in a collection, how they were processed, and what types of documents were included or excluded.
With TAR systems counsel have a harder time understanding the characteristics of their collections and are inherently limited in how transparent they can be about those characteristics.
Defensibility is of course closely linked to transparency – it’s hard to defend something you can’t describe. With visual classification, a lawyer can show a judge examples of how documents are grouped visually, how document-type labels can be applied, and can provide examples of what was included or included. The judge can literally see how visual classification works. By contrast, with TAR a lawyer is left talking about a black box that provides informed guesses and how statistical theory supports the notion that most or at least much of what should have been produced was produced.
E. Cooperation Options
Commentators and judges are pressuring litigants to lower cost by engaging in reasonable cooperation over discovery matters, e.g., see the Sedona Conference®Cooperation Proclamation,
With visual classification, counsel have far more options in terms of cooperating with opposing counsel. For example, with visual classification counsel have the ability to create document-type binders (either traditional 3-ring paper binders or virtual electronic binders like tabbed PDF collections) that could be shown to opposing counsel along with frequency reports to show how many of each type of document there are.
Because the visual clustering is automatic, and not controlled by the producing party, the requesting party will have far fewer concerns about being sandbagged by the process – the clustering is completely objective.
F. Power of Document-Typing
Human progress has been based on the development of language – on being able to apply labels to things and actions and use those terms in communications. By providing counsel with the option to label clusters of visually-similar documents, visual classification enables much more precise and powerful downstream functions. For example:
1. Even Greater Transparency Options. If counsel uses the option to apply document type labels to the at least the largest clusters, counsel can provide actual statistics on what type of documents are in a given collection and be able to provide substantiation.
2. Workload Allocations. Knowing what types of documents are present enables the review team to assign certain types of documents to certain types of reviewers, e.g., health records to nurse-paralegals and accounting records to CPAs. The result is a better work product and a better use of the most expensive part discovery review – labor costs.
3. Processing Priorities. Having document type labels enables counsel to communicate within the team, with opposing counsel, and with the court what the processing priorities are in a case.
4. Confidentiality. Confidentiality restrictions can be applied at the document type level, and not dependent on identifying search terms that may or may not be present in the documents that warrant protection.
5. Repurposing Work Product. With visual classification, the work product that goes into reviewing and labeling clusters can be transferred to other similar cases so that in each succeeding case the party starts from a more established base. The team knows what the document type labels mean, knows which ones were excluded before, knows which team members were allocated which document types, and can even know which ones resulted in documents being produced that were ultimately used as exhibits in depositions, at trial, or in briefings.
The only clusters of visually-similar documents that would have to be labeled would be ones where documents formed completely new clusters. With visual classification a point of “convergence” occurs where virtually all the documents being processed fall into previously reviewed classifications.
G. Document Attribute Extraction
With visual classification the producing party has the ability to extract attributes that appear on the face of the documents and place them in a specific field. For example, the producing party might want to extract invoice number from invoices so it could validate whether there were duplicates or were missing invoices in the collection. Visual classification includes the ability to look in either absolute or relative positions on a page within given document types and pull the data values found there. This includes a variety of delimiters to define what to pull within the given zones and several formatting options to normalize those values, e.g., to specify the date/time format when extracting dates.
Figure 5. Example of Attribute Extraction from Visually-Similar Cluster
H. Content-Enabling Image-Only Documents
When visual classification encounters image-only documents, it creates a text layer by converting the glyphs or graphical elements on the page to text which it then saves with the document in a format specified by the client, typically an image-with-text PDF document. As discussed later, TAR systems that are able to analyze metadata as well as text from the face of the document itself can use the document type labels and extracted metadata values to improve their performance.
I. Correcting Logical Document Boundaries
In the ordinary course of business, users combine multiple documents into single PDFs. For example, when sending out documents following a meeting, a business person might combine the agenda with handouts that were given out during the meeting, and notes or minutes from the meeting into a single PDF. The same sort of thing happens when paper documents are scanned.
Having multiple documents essentially buried or hidden under the top document can cause all sorts of problems when reviewing or searching the documents as often the first or top page receives most or all the attention. Combining multiple documents into one file can also cause anomalous results when using text-based TAR systems because all the text from one multiple documents are attributed to one document causing the document to be evaluated or ranked improperly.
Visual classification knows what the first pages of documents look like. It can therefore identity document boundaries within documents so that hidden or buried documents can be properly identified. This option is lacking in TAR systems.
J. Enhanced Find Functionality
Visual classification offers enhanced find functionality to differentiate among documents. For example, if a particular cluster of visually-similar documents contained both responsive and nonresponsive documents, the enhanced find functionality permits users to identify documents that have certain values in specific locations on the documents, e.g., a date range within the upper right two inches of the document, or an invoice range to the right of the term “Invoice number:” in the upper right part of the document. Another example would be to find the word “Entertainment” in the left third of a document and a value greater than $2,000 in the right two-thirds of the document on the same line as “Entertainment.” (Fig. 6. Enhanced Find Functionality)
These sort of absolute and relative positional operators are missing from TAR systems.
K. Text and Zone-Based Redactions
Visual classification catalogs the page coordinates of each glyph or graphical element, and when redactions are required to protect PII, it can be very precise in placing redactions to mask the sensitive data. In addition, it can redact zones within clusters of visually similar documents so that regardless of the terms that appear in the zone or even if there was handwriting used, all content in the zone will be redacted. This automated redaction can be performed at many hundreds of thousands of redactions per hour.
Figure 7. Example of Text-Based Redaction of Social Security
Numbers Based on Text Pattern
Figure 8. Example of Zoned Redaction Based on
Zone within Visually-Similar Cluster
Some TAR applications will be able to perform text-based redactions using word patterns or word lists, either using included functionality or by using third-party applications, but automated zoned redactions require that documents first be grouped by visual similarity.
L. Potentially Complementary Approaches
Visual classification and TAR are not mutually exclusive approaches. TAR can be made more effective if visual classification is used up front to weed out extraneous content, content-enable image-only documents, and make document types and extracted attributes available in the form of metadata for use by the TAR system.
M. Litigation Expenditures Can Fund Information Governance Initiatives
The intellectual effort that goes into applying document-type labels can be applied to reducing the effort required to accomplish document-centric information governance initiatives like content migration, silo consolidation, digitization, or file share remediation. If the organization has already used visual classification, the analysis that was based on those uses can reduce the effort required to select and produce documents from within those collections.
N. Completeness of Processing Stack
Visual classification is part of a processing stack that includes integrated collection, document-type designation, redaction, searching, glyph-to-text conversion, and attribute extraction. TAR systems will have major gaps in those areas and have to overcome them with third party software or services offerings.
In many ways it is unfair to compare TAR and visual classification. Visual classification was designed from the ground up to enable information governance initiatives that involve collecting, classifying, content-enabling, extracting attributes, redacting, and creating load files for enterprise-scale document collections. TAR’s mission was from the beginning far more modest in scope and its choice of approaches was limited by the mission – to do the best job classifying documents assuming that the only information available was the text that appeared in the document.
Within the scope of its original mission TAR does a good job. However, the mission has changed and the capabilities of modern technology have grown past just being able to analyze documents using text surrogates. Discovery review is more properly viewed as part of an ongoing information governance initiative, more of a process than a series of unrelated projects.
A. Summary Comparison Table
This is a summary of the features or benefits of visual classification vs. TAR:
Figure 9. Summary Comparison Table
B. About Working with BeyondRecognition
There are several options on how to work with BeyondRecognition or a member of its network of companies. We can provide services using a cloud model or place systems behind your firewall.