The usual approach to classifying files or documents in an enterprise collection of unstructured content is top-down: determine what the classifications should be and then write rules or scripts on how to place individual files in the predetermined classifications. This presupposes a comprehensive knowledge of what’s in a collection and what attributes can be used to differentiate individual files into meaningful classifications.
In his book, Everything is Miscellaneous,* David Weinberger suggest another approach for categorizing things: start with undisputed examples or prototypes without worrying about having clear definitions. He observes that this is the way that people are accustomed to using categorization:
“The biological aim of categorization [is] dealing rapidly with an ever-changing environment by assimilating the new to the already established.” [p. 179, Kindle location 2995]
As he states later:
“We can stipulate a definition that will work at least pretty well, even though it is arbitrary and artificial. But that’s not what experience looks like. First comes a hands-on, body-and-soul roughhouse of organization built on multifaceted resemblances to clear examples. Lines come later, and only when we’re forced to draw them.” [p. 188, Kindle location 3154].
The obvious question then is how to select the prototypes or examples. One answer for enterprise content collections is visual classification, a technology which groups visually similar files. There is no human effort involved in trying to associate visually-similar things, they’re already grouped and the technology can present the most representative files. When reviewing the groups or clusters, content managers can be sure that no major category of files was overlooked.
One of the unique values of visual classification is that it considers layout and the size of graphical elements in its grouping of visually similar objects. Nobody needs to try to explicitly define the various file facets or attributes that cause like things to be grouped. This mimics the way humans group like things. In fact, Weinberger talks about how humans are excellent at reading multiple implicit cues, and gives the example of how things like the overall page layout, the size and placement of words, and the typefaces used all help us determine what a Parade article looks like. As he says,
“We grasp all of this without any of it being explicitly labeled, because it’s obvious from the implicit cues. We can read metadata before we learn to read.” [p. 149, Kindle location 2497]
After the visually-similar groupings are formed, they are reviewed and given a classification from a two- or three-level classification tree that typically includes the business unit or function as the top level and the document type as the second unit. If the organization later wants to restructure the classification tree, the grouping number remains associated with the files and they can all be reclassified just by associating the grouping number with a different classification.
The resulting classification is meaningful because it was built from the bottom up, starting with groups of visually similar files that are labelled in an organizationally meaningful classification scheme. The data has informed the organization what groups of files need to be classified. This approach meets the goals for a classification scheme as described by Alan Stern, a planetary scientist at the Southwest Research Institute:
“I’m agnostic. I want the data to inform me. I want to have a classification scheme that illuminates.” [p. 38, Kindle location 798]
For more information on ECM classification, contact info@BeyondRecognition.net.
For further reading, see:
- The Four Key Dimensions of Purpose-Driven Data Quality: http://beyondrecognition.net/four-key-dimensions-purpose-driven-data-quality/
- Limitations of using OCR for File Classification: http://beyondrecognition.net/limitations-using-ocr-file-classification/
- Need More than Text Search for Unstructured Content: http://beyondrecognition.net/need-more-than-text-search-for-unstructured-content/
To receive a complimentary copy of Guide to Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content: http://beyondrecognition.net/guide-to-managing-unstructured-content/
*Everything is Miscellaneous, The Power of the New Digital Disorder, by David Weinberger, is available on Amazon at: https://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder-ebook/dp/B000R7PUW4/#navbar