Calculating MTV Ratio and True Recall
Many tools designed to search or classify documents as part of the enterprise content management and electronic discovery functions in organizations depend on having accurate textual representations of the documents being analyzed or indexed. They have text-tunnel vision – they cannot “see” non-textual objects.
If the only documents of interest were text-based, that could be an excusable shortcoming, depending on what tasks were being performed. However, there are some collections where as many as one-half of the documents of interest contain no textual representation. For example, in the oil & gas industry engineering drawings are often converted to image-only PDFs for distribution and comment, and other documents such as printed well logs will be challenging to for many systems to convert to accurate text.
Instead of ignoring the text bias, corporations should quantify its severity by making a few basic calculations.
Maximum Textual Vision (“MTV”) Ratio is a percentage that indicates the extent to which a text-restricted technology can “see” the documents of interest – this provides a best case ratio of the proportion of documents that are potentially classified or searched by text-restricted technologies. It is the number of documents with accurate text divided by the total number of documents – in the diagram below it is the green circle divided by the tan circle.
MTV = (Documents with Accurate Text) divided by (Total Documents in Entire Population)
The Total Documents in the Entire Population should include scanned TIF and PDF, any CAD/CAM output files, and any other image-based documents. The number of documents with accurate text can be calculated by extracting text from files and counting files with five or more standard stop words (words like “the,” “and,” “is”). Requiring stop words avoids counting files as having text when the text is errata or noise from the conversion process, or where the files do not have a useful amount or quality of text.
The MTV Ratio can be used to assess claims of recall which is usually defined as relevant documents of interest as identified by a process divided by all relevant documents actually of interest in the collection. For example if a text-restricted vendor claims 80% recall but the MTV Ratio of the collection is 50%, meaning only one half of the documents had useful text, then the vendor is really claiming it got 80% of the documents of interest from one-half of the collection.
As another example, if a text-restricted ECM vendor claims it can classify 90% of a company’s documents, but the MTV Ratio is 70%, that means the vendor can classify 63% of the company’s documents.
The Impact of Non-Managed Objects – Estimating True Recall
What most organizations want to measure is the extent to which the textual tunnel vision of any technology could be impacting their operating metrics. To see the potential impact, first calculate the Non-Managed Objects which is the total number of documents less the number of documents with accurate text.
NMO = (Total Documents in Collection) – (Documents with Accurate Text)
NMO can be used to identify the potential tunnel vision impact. This is based on the fact that the number of documents of interest in the NMO documents will vary from zero to all of them.
To illustrate, assume that a collection has 2 million documents, one million of which have text and one million of which do not. If there are 100,000 text-based documents of interest, and a text-restricted vendor locates 90,000 of them, the claimed recall rate is 90%. If there are no NMO documents of interest the claimed recall rate is actually the true recall rate. However, in the extreme case where all the NMO documents were of interest the total documents of interest would be 1,100,000 and the true recall would be 90,000/1,100,000 or a little over 8% – less than 10% of the claimed recall.
If the proportion of NMO documents that are actually of interest is the same as for the text-based documents (i.e., 10%), the true recall would be 90,000/200,000 or 45%.
Organizations should evaluate their document collections to determine whether text-restricted technologies are leaving a large percentage of documents of interest unmanaged. Sampling non-text documents could provide guidance on this question although text-only technology will most likely not be able to locate other documents that are like any relevant non-text documents located during the sampling.
Vendor-provided metrics like recall should be evaluated in light of the Maximum Textual Vision of such providers in the population under consideration. Finally, organizations should be aware that visual classification technology can classify all documents regardless of the presence or absence of text in the documents which the organization wants to manage – they are no longer “stuck” with older text-based technology.
For related posts, see also, “Predictive” Coding and the Naked Emperor, and Cloud-Based BPO: Document Classification Made Easy.