There has been ongoing debate in information governance and e-discovery circles on the significance of documents that do not contain searchable text, with evidence that half or more of the documents in some collections cannot be analyzed or managed because the tools used for those purposes require textual representations.

BR_Sampling_QuoteHow important is this limitation in any given collection? Without sampling nobody knows – it’s all conjecture – and conjecture is a poor foundation for sound information governance.

Sampling provides a way to estimate the impact of using text-restricted document management tools in any given collection. The basic idea is to first identify the proportion of documents that do not have meaningful textual representations and then assess how significant they are. The first is an objective measure, the second is more subjective.

First Step: Calculating the Proportion of Documents without Meaningful Textual Representations

One critical point is that managers should not assume that they know what file types are used to store, distribute, or view their “documents” – documents are not restricted to just Word documents. Just about every computer file is at the very least a potential document. For example, CAD/CAM drawings are as much “documents” as printed engineering drawings, only more useful. The original native files and files created for view-only purposes all ought to be considered “documents.” The best approach is to sample all files and disregard specific file types only after having examined some of them.

Text-restricted_systems-no-evidenceThere are two ways to quantify the proportion of documents lacking meaningful textual representation:

1. Using Textual Analysis. Depending on the sophistication of the text tools available, document managers could have those tools extract text from the documents being managed and then count how many of those documents have five or more “stop” words (words that occur so frequently that they are often not indexed, words like “is,” “are,” “and,” and “like”). The reason for requiring five or more stop words is to avoid counting documents for which there may be essentially gibberish produced from the attempt to convert to text, or for which there may an error message generated indicating no text was extracted.

The proportion of documents lacking textual representations is a fraction with the numerator being the number of files without adequate textual representations and the denominator as the total number of files. Include all TIF, PDF, EPS, and CAD/CAM drawings in counts for both the numerator and denominator as appropriate. If you’re relying on your text-restricted system to do the inventorying, double-check the totals by using a non-text based software.

2. Sampling All Documents. A second approach to calculating the proportion of documents without meaningful textual representations is to sample all documents in a collection, including all TIFF, PDF, EPS, and CAD/CAM drawings, and then see if they can be located by the text tools in use.The distribution of documents without text will not be random, there will be some file types with a far greater proportion of non-text-manageable documents than other, so be sure to sample within file types. For example, in one file share remediation project BR found that 80% of the PDF files had no text layers. There may be other file types, e.g., some Excel files, that particular text indexing tools fail to index properly. Again, without sampling it’s all conjecture.

Another suggestion is to sort the documents within file types by date so that documents created by different versions of software will be generally grouped together. As indicated above, the distribution of files that will not be accurately indexed by text-restricted systems is not random, so conduct your sampling to find specific problems.

Second Step: Assessing the Significance of Non-Textual Documents

Assessing the significance of non-textual documents is a more subjective endeavor. Basically a subject matter expert reviews the documents identified as being non-textual and assesses how significant it is to the organization that it’s content tools can’t find or manage those objects. The SME would consider questions like:

  • What is the business risk of not being able to locate and use the information in these files?
  • Can the organization comply with discovery or regulatory investigations without being able to locate these documents?
  • If there are paper counterparts to these files how can we classify and manage both the paper and electronic versions in the same way?

However you select or analyze the sample, the goal is to be able to create a summary analysis that looks something like this:

Sampling_Results_Table_2The results of the sampling can guide whether further action is required to properly manage specific file types within a collection. Whatever the outcome, a modest level of sampling of documents in a collection will go a long way to answering the question of how significant non-textual documents are in a collection.

Comments are closed.