The Grossman-Cormack article, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” has kicked off some useful discussions. Here are our comments on two blog posts about the article, one by Ralph Losey, the other by John Tredennick and Mark Noel:
Losey: The Text Streetlight
Ralph Losey made an interesting point in his July 6, 2014, blog posting about the Grossman-Cormack article: people tend to look for things where it’s the easiest, which may be different from where things are most likely to be.
He called this type of observational bias the “streetlight effect,” and illustrated it with a “lost keys” example:
“You know you dropped your keys near your front door, but you do not look there because it is dark, it is hard to search there. You take the easy way out. You search by the street lamp.”
He was discussing the use of random sampling to select seed documents for Technology-Assisted Review (“TAR”), but the same point can be made about TAR itself – relying on text to find and anlyze documents may be easy in the sense that there many text-based tools, but it ignores the fact that significant proportions of many document collections don’t have meaningful, accurate text to analyze. TAR looks in the area illuminated by the text streetlamp but it can’t “see” image-only documents.
Tredennick & Noel: Oversampling & the Value of Clustering
John Tredennick and Mark Noel of Catalyst offer a very useful model of sampling in their blog post, “Comparing Active Learning to Random Sampling: Using Zipf’s Law to Evaluate Which is More Effective for TAR.” They state that topic clusters of documents follow the “Zipf” distribution, with the largest cluster being twice as large as the second-largest and three times the third-largest, and so on.
They point out that random sampling essentially oversamples the larger clusters while completely missing smaller clusters. For example, in the diagram below, each circle is a cluster and the area of each circle represents the number of documents in it, the intersections of the grid lines represented sampling points, and green circles represent clusters that would be sampled under a random approach. You can see that the largest cluster in the center would have been sampled 13 times, while many other clusters would have been completely unsampled.
They call their solution “contextual diversity” and it apparently operates by clustering documents based on the terms (i.e., “text”) used in them, and then using the cluster information to present more clusters for review than would be achieved by simple random sampling. They have helpful Venn diagrams depicting how their contextual diversity approach does a more thorough job than random sampling.
What they don’t talk about is that even if contextual diversity correctly samples all the text-based documents, it is dependent on having text to analyze, and in some collections, up to half the documents may be non-textual, e.g., image-only PDF, well logs, schematics, etc.
Of course the majority of bloggers blogging about these sorts of things are really promoting the technology that they either use or sell, e.g., Roitblat, Grossman, Cormack, Losey, Tredennick, and Noel. We’re no different, but we do think it’s fair to ask the questions, “If clustering is useful on the proportion of documents that have text, wouldn’t it be even more useful if you could cluster non-textual documents? If you found a relevant non-textual document, wouldn’t it be nice to have a way to find similar documents?”
The fact that we provide enterprise-scale visual similarity technology shouldn’t detract from the value of asking these questions.
The problem with many e-discovery approaches is they seem to be based on a standalone crisis managment model with little pre-litigation information management and with little carry forward to the next case. Visual classification can be used in a steady-state mode where document classification, retention, and indexing is done as an integral part of day-to-day operations. Document classification and granular retention schedules can be used to minimize content on file shares and can be used to select which types of documents are apt to be relevant to any given litigation. Attribute extraction developed for migrating content to ECM systems can also be used to provide specificity or particularity for documents selected for review.
For other related blog posts,see: