Document attributes provide ways to find and navigate among documents of interest. In fact, one of the biggest challenges in e-discovery and content management is to assign the best classifications and attributes to documents in a collection. Here is a checklist of where to look for different types of attributes:
Within the Files Themselves, Either Visible or Hidden
- All Visible Text (AVT). This is the default retrieval mechanism for many collections. It does not provide for retrieval of graphical elements like logos, signatures and may or may not provide for retrieval on graphs, charts, or diagrams. It does not retrieve files that do not have text layers.
- Contained Hidden Metadata (CHM). Often files contain metadata that is not normally visible to the user, and this may be useful in retrieval, although it may also contain misleading information, e.g., “author” may be the person who created a template, not the person who actually created a document.
- Selected Visual Attributes (SVA). One of the most effective ways to find and navigate content is to classify files and then identify the attributes that help differentiate among members of the classification. For example, in a collection of direct mail pieces, the “boilerplate” or “form” part of the direct mail will remain constant with only the names and addresses of the recipients changing. The changing part of the documents can be extracted and placed in fields created for that data. Note that the differentiating attributes can also be graphical in nature.
Externally Maintained Attributes, i.e., Metadata Maintained Outside the Files
- Operating System Metadata (OSM). Operating systems typically track items that are not within the files themselves, e.g., things like file size, folder path/filename, date last accessed.
- Content Management Attributes (CMA). ECM or CMS systems can be a rich source of attributes about an organizations documents. This can include the basic classification, all people editing a document, document purpose and audience, taxonomy terms applied, and folksonomy terms applied.
- Attribute Linking Lists (ALL). Some attributes will appear on lists of related attributes that can be used to further classify documents. For example, in the oil & gas industry, wells have API numbers that are cross references to exact location including state, county, and GPS coordinates, date first drilled, date closed in, etc. Using attribute linking lists permits being able to find those documents based on the linked information, e.g., locating documents based on locations referenced.
Collection Inferred, i.e., Based on Analysis of Subsets of Documents in Collection
- Paper Box & Folder Info (PBF). Information contained on the labels of stored boxes or folders of information can be useful in finding and evaluating the pages that appear within those boxes or folders, and the box and folder labels are often scanned along with the underlying documents. To the extent there is a document archive system, the box number or folder label can also provide a way to associate archive system information with the contents.
- Cluster High-Value Variables (CHV). When files and documents are clustered visually, documents in some clusters will be essentially the same, with only some information changing, e.g., clusters of retail installment notes. It may be much more useful to just extract these high-value variables and use them for retrieval and differentiating among members of the cluster. Basically, the recurring content is negated or dropped for some purposes. Note that storing just the high-value variables does not require placing each data element in a separate field. Just identifying them can aid retrieval and permit associating family groups of documents even though there may be multiple document types in a family, e.g., to associate all the loan documents associated with a specific borrower.
- Common Folder Attributes (CFA). When many of the documents in a paper or electronic folder reference or contain the same attribute, it may be appropriate to associate that attribute to the other documents, even if they don’t explicitly use the term associated with that attribute. For example, if most of the documents in a folder of oil & gas documents reference the same API number, it may be useful to associate that API number with equipment purchase documents in the folder, even if they don’t explicitly reference that number.
There are undoubtedly other sources of attributes that are useful in locating or navigating among relevant documents, but hopefully this will provide food for thought on places to look.
- “E-Discovery: Using Folder Paths and Filenames to Create Folksonomies,” http://beyondrecognition.net/e-discovery-path-names-filenames-for-folksonomies/
- “Requesting Document Attributes from ECM/CMS Systems in e-Discovery,” http://beyondrecognition.net/requesting-document-attributes-ecm-cms/
For more information on managing unstructured content, go to the following link for a free personal-use download of the e-book, Guide to Managing Unstructured Content: http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/