The three most important criteria by which to judge file or document classification and coding systems are
- Consistency &
The reason is pretty obvious: without consistency a file classification scheme cannot deliver any of the promised downstream benefits, things like enhanced retrievability, selection of appropriate retention schedules, and setting appropriate security access permissions (see graphic).
As long as file classifications are consistent, classification-dependent actions can be corrected. However, if classification errors are essentially random, the whole process may have to be redone to correct any problems.
Let’s consider how initial file classification error rates impact downstream tasks of (a) coding or extracting document attributes, and (b) text recognition.
Coding or Attribution
Many times file classification will be tied to document coding so that certain attributes will be extracted from certain document types. On leases for example, organizations may want the leasor, the leasee, property description, term, and lease rate extracted and placed in certain fields.
The number and types of elements extracted will naturally vary with the document-type classification assigned. For example, in Oil & Gas there are Division Orders, and Division of Interest. One may have nine fields to extract, the other five, with only three of them overlapping. As a consequence if the wrong classification is applied, either:
- Six out of nine desired attributes will be missed, leading to a 66% error rate at the field attribution level, OR
- Two of the five desired attributes will be omitted, leading to a 40% error rate.
As can be seen, a relatively small error rate on initial file classification can have a cascading effect and lead to unacceptably high error rates for extracted attributes.
Text Recognition or Conversion
File classification may determine the level of text recognition or conversion to be applied to files or documents. For example, if a particular type of document is created using a template where most of the text is boilerplate, the enterprise may decide to only track data that was entered on the form, but not all the terms from the boilerplate. This conserves conversion resources and avoids cluttering the ultimate content management or file retrieval system with what are essentially noise words.
When file classification decisions assign the wrong document types, this can introduce errors like those introduced for the coding or attribution process – the text for words that should have been converted are not converted, and some words that shouldn’t have been are converted. Either way, word-level accuracy calculations can suffer greatly by the initial classification errors.
The key to classification is consistency, and one of the huge benefits of visual classification is that the initial grouping or clustering is objectivist in nature, no operator intervention is required or used. Once a document type classification is assigned, it is assigned consistently even for files subsequently analyzed, and if the classification needs to be changed, it can be changed consistently.