Simple text search architecture – where every non-noise word of every document is indexed – doesn’t work well at enterprise scale. This approach consumes considerable IT resources and, from an end-user perspective, returns considerable numbers of irrelevant results for searches. This approach may work on small personal collections where it’s not too burdensome to wade […]

Read More

Selection bias occurs when data are selected for analysis in a way that not all objects being evaluated are equally likely to be selected. This results in samples that are not representative of entire populations. An extreme example would be predicting the presidential race by only sampling New York City or Los Angeles, or predicting all […]

Read More

Whack-a-mole describes a situation in which attempts to solve a problem are piecemeal or superficial, resulting only in temporary or minor improvement, as in, “the site’s security team has an ongoing battle against spammers, but it’s a game of whack-a-mole.” See Oxford Dictionaries. The whack-a-mole concept is familiar to those attempting to classify documents using […]

Read More

Technology-Assisted Review (TAR) or Predictive Coding (PC) is an attempt to minimize discovery review costs by minimizing the number of review decisions that have to be made by the attorneys. TAR/PC proponents point out that TAR is generally as effective as the “gold standard” of human review in identifying relevant records. However it has at […]

Read More

For many years information governance failed to achieve many of its goals primarily because document classification, which is the necessary first step in nearly all IG initiatives, has proven difficult to achieve on an enterprise scale. No more. 2015 marks the beginning of a whole new era. Background People involved in information governance have long known […]

Read More

The Grossman-Cormack article, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” has kicked off some useful discussions. Here are our comments on two blog posts about the article, one by Ralph Losey, the other by John Tredennick and Mark Noel: Losey: The Text Streetlight Ralph Losey made an interesting point in his July 6, […]

Read More

There has been ongoing debate in information governance and e-discovery circles on the significance of documents that do not contain searchable text, with evidence that half or more of the documents in some collections cannot be analyzed or managed because the tools used for those purposes require textual representations. How important is this limitation in […]

Read More

Calculating MTV Ratio and True Recall Many tools designed to search or classify documents as part of the enterprise content management and electronic discovery functions in organizations depend on having accurate textual representations of the documents being analyzed or indexed. They have text-tunnel vision – they cannot “see” non-textual objects. If the only documents of […]

Read More

The Emperor has No Clothes – and PC Can’t See Image-Only Documents There are several parallels between predictive coding (AKA technology assisted review) and Hans Christian Andersons’ tale, “The Emperor’s New Clothes.” In the story, two weavers tell the emperor they will make him a suit of clothes that will be invisible to those people […]

Read More