Sometimes profound implications become apparent from thinking through the implications of direct observations and sampling to determine the extent of the observed conditions. This is a story about the consequences of observing four Authorizations for Expenditures (“AFEs”) and a Daily Drilling Report (“DDR”) while in a meeting with an energy client talking about file share remediation.
The thing was, the AFEs and DDR were in one “document,” the four single-page AFEs were at the beginning and the three-page DDR was at the end.
“No big deal,” you say? Consider these additional observations: The four AFEs were for four different wells, and the DDR pages related to yet a fifth well. What happens to users’ faith in the ECM system when they do a search for a well and they pull up a document that has a different well as the top page? If the indexing scheme contemplates one well name per document and that’s taken from the first page, are the other six pages relating to the other four wells essentially invisible for purposes of fielded searching?
And think about the search technology itself. If the enterprise is using advanced text search, how does the search algorithm weight the terms that appear in the combined AFEs and the DDR pages? If documents are processed for review by a system that clusters like documents based on the text they contain, where does this document fall? With AFEs? With DDRs? Is this document neither fish nor fowl or is it both fish and fowl? Conceptually superior technology may not fare so well with free-range documents.
The other thing was, 76% of the collection had similar logical document boundary errors. This leads to another implication, which retention schedule applies, the one for AFEs or for DDRs? With a 76% error rate, the organization might have to opt to keep everything indefinitely because document classification is clearly problematic.
One lesson here is to sample your content for document boundary issues. In some cases like the transfer of documents following an oil & gas asset sale/purchase, there are significant chances that each box of documents is treated as a single “document.” But even normal file shares have a mixture of native electronic and scanned documents with document boundary issues in both types of documents.
But to conclude the story, our client opted to use BR technology to assign logical document boundaries. BR’s visual classification technology learns what the pages look like that start or begin documents in the client’s collection (the BOD’s or Beginnings of Documents). BR’s visual classification technology does not base its analysis on the text that was associated with a page. Being text independent, the classification and document boundary determinations work equally well with scanned non-text TIF or PDF images as with native electronic files.
Each logical document is then indexed separately, including extracting attributes using BR’s automated zonal extraction, but the computer records for each logical document will contain a pointer back to the hash value for the original document, thereby permitting users to determine what sequence of pages was in the physical document. So the second lesson is that there is a highly automated and reliable way to resolve widespread document unitization issues.
Other posts relating to sampling:
- Measuring Text Bias/Tunnel Vision in Content Search and ECM Systems LINK
- Sampling Resolves Conjecture on Significance of Non-Textual Documents LINK
- The Text Streetlight, Oversampling, and Information Governance LINK