Selection bias occurs when data are selected for analysis in a way that not all objects being evaluated are equally likely to be selected. This results in samples that are not representative of entire populations. An extreme example would be predicting the presidential race by only sampling New York City or Los Angeles, or predicting all voter behavior by analyzing just raw social media posts.
Similarly, selection bias can be a significant problem in e-discovery when text-search software is used to select files and text analytics is used to evaluate them. Although text-based tools have been available for a long time and are commonly available, using them can be like looking for lost keys under a streetlight because that’s where the light is – even if the keys were lost somewhere else.
Text-based selection and text-based analysis can introduce selection biases. Those text-based biases are not apt to be random, they are actually apt to occur in recurring situations, e.g., when files are created as the result of specific processes or from particular departments, functions, or vendors.
There are at least three dimensions to text-related selection biases:
- Non-Extracted Text. Some files have no text associated with them, e.g., they are image-only PDF or TIF.
- Unconverted Embedded Graphics. Other files can contain both textual characters and embedded graphics that display textual information, but the embedded graphics don’t get converted to text.
- Relevant Terms Not in Required Grammatical Units. Files can contain relevant text but the terms can be contained in grammatical units that are ignored by the analytical software.
Many files are saved to image-only format. These image-only files will be invisible to text-based key term selection software or text-based analytics or predictive coding engines unless they are successfully converted to text. Here is an example of such files taken from a recent 400K file collection:
Files without Associated Text from 400K File Collection
As you can see, over 25% of all the files were image-only files and wouldn’t have been selected by or analyzed by text-based software. The files came from an oil & gas company and many of the JPG and PNG images were screen captures of important analyses taken from large, high-resolution geophysical work stations.
Mixed Text & Image Files
Data can be presented in documents as either text characters or as embedded images. For example the table shown above used html code to present the information using text characters. If you click and drag on it you’ll see that you can copy the text into a Word document and edit it. The table shown below is the same information presented as an embedded image. It can’t be edited in a text editor because it isn’t text. The terms in the image won’t be found by text search software nor will they be evaluated by text analytics engines.
Relevant Text Not in Recognized Grammatical Units
Some files may have textual terms that cause the files to be collected for e-discovery because they are detected by the text search software used for collection, but then the text analytics or predictive coding software used to classify them may not include those terms because they are not in the grammatical units used by such software, e.g., if they are not used in sentences. Such exclusion of terms not found in the preferred grammatical unit can cause the omission of lists, presentations, or documents consisting primarily of major data tables.
Evaluating Responsiveness Claims
Claims made as to the percentage of responsive files produced need to be factored by the percentage of files NOT considered. For example, if a party claims it produced 70% of all potentially responsive files but it only considered 70% of an organization’s files, it may have produced only 49% (i.e., 70% of 70%). Of course, if the excluded files were proportionately more responsive than the files considered, the actual percentage of responsive files produced may be substantially lower than 49%.
Visual classification avoids these types of text bias issues for file classification because it uses visual representations of files, NOT their associated text values. In essence, the technology shines more light on more kinds of files and classifies consistently across file types whether or not they contain text layers. For example, visual classification correctly and consistently classified many of the JPG and PNG screen captures shown in the tables above.
For more information about managing unstructured content, download your own free personal copy of Guide to Managing Unstructured Content, Practical advice on gaining control of unstructured content.