How organizations deal with outliers, those data points that occur where they’re not expected, provide useful insights into the culture and data maturity of those organizations. Outliers occurring in simple frequency graphs could be blips that occur at the extreme ends of the normal curve. In e-discovery, outliers can be documents flagged by analytics software as non-responsive that upon review are considered responsive. They could also be those files flagged as responsive that upon review are considered non-responsive.
One response to outliers is to shrug shoulders and say something like, “This is expected with statistics, we handle a lot of data so we see a lot of what might be called anomalous data points.” Shoulder-shrug responses can be a sign of overwork, lack of appropriate tools, or contentedness with current results.
Another type of response is to try to learn from the outliers, to drill down into possible explanations either for specific outliers or for more systemic issues that may be causing them.
When examining frequency-graph outliers, users can try to obtain more information about the characteristics of the outliers. For example, if the graph showed morbidity data for men born during a specific year and a small group appeared to have unusually early dates of death, users might want to examine characteristics about the outliers to find possible explanations, e.g., weight, tobacco or alcohol usage, exercise levels, family medical history, exposure to toxic chemicals or asbestos, etc.
In examining e-discovery outliers, there are several things that managers ought to ask:
- Why Responsive. Are there specific attributes that make the documents flagged by analytics software as nonresponsive actually responsive? E.g., signatures, checkmarks, stamps, routing information, dates?
- Finding Other Like Documents. Can the system programmatically check for the attributes in other documents of the same type as the outliers that caused them to be responsive? Or, a more basic question, can the system identify other documents of the same type as the outlier regardless of specific attributes? If the analytics engine is basically a black box, users may be left trying to guess why other similar files weren’t found. There are other possible causes: Were the other files embedded in larger documents? Were they in languages different from the newly-discovered outlier? Are there critical OCR errors? Do some of the files have embedded graphics that don’t get converted to textual characters?
- Biases Excluding Other Documents of the same Document Type from the Collection. Are there reasons why more documents of the same document type were not included within the collection being analyzed? For example, an earlier posting on selection bias discussed how keyword selection and text-restricted analytics can systematically exclude image-only or limited-text files (see: https://beyondrecognition.net/selection-bias-e-discovery/). Sometimes image-only files can slip through, e.g., as attachments to emails. Lacking text they will be machine classified as non-responsive or “other.” Even when identified as responsive by a manual review there may be no way to specifically identify other like files.
- Quasi-Outliers: “Other” Files. Although not normally thought of as “outliers,” the “other” files which the analytics software cannot include or exclude as responsive are in a sense outlier in that they don’t fall in the desired classifications. They can represent a significant percentage of all files requiring review, and system managers ought to ask questions like those posed above on why they can’t be handled programmatically, e.g., is there a confirmation bias that implicitly expects that image-only or poor-text documents will have to be treated as exceptions or, worse yet, won’t get processed at all? On the latter point, see my earlier blog posting: https://beyondrecognition.net/eliminating-confirmation-bias-file-classification-ecm/. Would the rest of the organization be happy with a production system that could only produce 70-80% of the required items?
For more information on managing unstructured content, download my new e-book, Guide to Managing Unstructured Content. It includes a discussion on how visual classification overcomes many of the problems of dealing with outliers discussed in this post.