Implicit biases – those that we form and use without explicit consideration – can wreak havoc on achieving critical goals. One such type of bias is especially damaging when designing file classification systems – confirmation bias. That is the
“…tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses, while giving disproportionately less consideration to alternative possibilities.” *1/
Examining the beliefs and hypotheses underlying confirmation bias can help eliminate many of its dangers. Consider the normal thought process for setting up file classification systems:
- Goals: We need to be able to manage all the files that our users generate or receive over time. We want to be able to assign appropriate retention schedules, restrict access to sensitive content, and permit our users to search for and find the files and documents they need for their jobs.
- Methodology: Rather than relying on user-assigned classifications we can use the words that appear in the files as a way to classify and retrieve them. Words that uniquely identify specific types of files can be arranged in taxonomies to aid classification and retrieval; and scripts, rules or analytics software can be developed to classify document types by analyzing the associated text.
- Validation: We can select files from within classifications and confirm whether they the classification is appropriate for them. We can also randomly select files from the search results and determine if those files were appropriately classified.
Here are the beliefs or hypotheses that underlie the thought process about file classification:
- The files we want to manage all have text that can be used by the analytics and search software used for ECM.
- The text associated with files is the best and really only way to classify and search them.
- Text-based taxonomies are able to consistently identify which classifications should be assigned to specific files based on the words that appear in them.
- Text-based scripts, rules, or analytics software is able to consistently assign the correct classification to files at the scale and volume required for our organization.
- Examining files within classifications provides accurate validation or quality control of the process.
Here is information that challenges the beliefs or hypotheses underlying those biases:
- Not all files have associated text that can be used for classification and retrieval. PDFs and TIF files are often image-only, and image-only files are invisible in text-based approaches. In some collections image-only files account for 30-40% or more of all files. Do you know what percentage of your files are non-textual?
- The text associated with individual files is useful for many purposes but it is essentially one-dimensional – words are either before or after other words. Text alone lacks the richness of the visual appearance of files which includes additional attributes that help differentiate among different classifications, things like layout and the size and placement of graphical elements. A visual classification approach deals with far more files far more consistently than text and is scalable for the largest collections, with service level agreements guaranteeing above 99% accuracy.
- Taxonomies don’t work for non-textual files and are problematic for many files including foreign language documents and OCR’d files. Taxonomies are expensive to develop and maintain, and misclassify many files.
- Text-based scripts, rules, and algorithms suffer the same limitations of taxonomies, and often do not scale well into the hundreds of millions or billions of files encountered in large organizations.
- QC of text-based systems normally doesn’t examine non-textual files thereby yielding misleading quality statistics. Examining only files within specific classifications won’t show the false negatives – the files that incorrectly received other classifications – and won’t show where there are overlapping classifications, i.e., different classifications that are assigned to essentially the same type of files.
The executive summary of this post is that decision-makers whose confirmation bias leads them to consider only traditional text-based solutions for file classification will miss out on the significant advantages and benefits of visual classification.
To reserve your copy of Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content, please visit http://beyondrecognition.net/guide-to-managing-unstructured-content/