Simple text search architecture – where every non-noise word of every document is indexed – doesn’t work well at enterprise scale. This approach consumes considerable IT resources and, from an end-user perspective, returns considerable numbers of irrelevant results for searches. This approach may work on small personal collections where it’s not too burdensome to wade […]

Read More

Simpson’s Paradox is a kind of statistical brain teaser that provides lessons on text analytics and choosing the best tools to work with enterprise content. The “paradox” is that sometimes trends that seem apparent when data are analyzed as separate groups become reversed or disappear when the groups are combined. An example of Simpson’s Paradox […]

Read More

Most file auto-classification systems rely on the presence of accurate textual representations of the files being classified. Organizations that use those auto-classification systems need to be aware of several problems with a text-reliant approach: Ignoring Non-Textual Files. Many files have no text associated with them, e.g., files output as PDF or TIF files from user-software or captured as […]

Read More

Whack-a-mole describes a situation in which attempts to solve a problem are piecemeal or superficial, resulting only in temporary or minor improvement, as in, “the site’s security team has an ongoing battle against spammers, but it’s a game of whack-a-mole.” See Oxford Dictionaries. The whack-a-mole concept is familiar to those attempting to classify documents using […]

Read More

(To download a PDF version that you can read offline, click here.) Summary Technology Assisted Review (“TAR”) and visual classification take two different approaches to classifying documents. TAR uses the text associated with the documents being classified while visual classification bases its analysis on graphical representations of those documents. TAR is an outgrowth of tools […]

Read More

The Grossman-Cormack article, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” has kicked off some useful discussions. Here are our comments on two blog posts about the article, one by Ralph Losey, the other by John Tredennick and Mark Noel: Losey: The Text Streetlight Ralph Losey made an interesting point in his July 6, […]

Read More

There has been ongoing debate in information governance and e-discovery circles on the significance of documents that do not contain searchable text, with evidence that half or more of the documents in some collections cannot be analyzed or managed because the tools used for those purposes require textual representations. How important is this limitation in […]

Read More

Calculating MTV Ratio and True Recall Many tools designed to search or classify documents as part of the enterprise content management and electronic discovery functions in organizations depend on having accurate textual representations of the documents being analyzed or indexed. They have text-tunnel vision – they cannot “see” non-textual objects. If the only documents of […]

Read More

The Emperor has No Clothes – and PC Can’t See Image-Only Documents There are several parallels between predictive coding (AKA technology assisted review) and Hans Christian Andersons’ tale, “The Emperor’s New Clothes.” In the story, two weavers tell the emperor they will make him a suit of clothes that will be invisible to those people […]

Read More
The BeyondRecognition Network

the-beyondrecognition-network-of-companies