Many times it’s the hidden assumptions, the ones that are never made explicit or never periodically confirmed, that can cause the most damage when they’re wrong. In Information Governance and Enterprise Content Management one of the most basic assumptions is that there is one logical document per file with electronic records, and one logical unit per physical document in the paper world. Wrong.
Nowadays practically everybody can take multiple documents from word processing, spreadsheet, and presentation applications, and combine them in one large PDF, and scanning operations routinely set incorrect document boundaries. However, ECM systems are generally set up so that there is one document type, one title, one author, and one document date per “document.”
When there are multiple documents in one compound document, typically only attributes for the top document are entered. The embedded documents that occur after the top document are essentially lost in space – and in some collections, up to 20% or more of the documents are compound. Users can’t find them by performing field-limited searches on the relevant document types, title, author, or date, can’t sort search results by the attributes of the non-primary logical document, and can’t create formatted reports using those attributes. Asset purchase agreements can be hidden in the ECM record for an invoice.
Enterprises can use sophisticated search algorithms to try to compensate for the loss of precision in the fielded data, but even those algorithms are degraded when multiple logical documents are combined as single documents. Plus, many of an organization’s documents may not have associated text and can’t be located by text-based searches and can’t be analyzed by text analytics.
In the long run, ECM systems could be modified to permit multiple logical documents per file. This could require that the systems be able to point to locations within files where the embedded logical documents could be found. In PDFs this could be done by bookmarking the embedded files.
An alternative is to bust larger compound documents into the smaller logical documents contained within them so that each ECM record could point to just one logical “document.” This would then require that there be ways to associate the embedded documents back to the original larger document, much like email attachments are typically associated with the email that transmitted the attachment.
Visual classification provides the technology that can separate larger compound documents into the smaller logical documents that they contain. Whether the compound documents are native files or scanned documents, BR learns what the top pages of logical documents look like, and can then identify these beginning-of-document pages in the collection. Depending on what the clients want to do, BR can then place a bookmark at that location or can break the larger file into smaller files and create load files for updating the ECM system.
BeyondRecognition’s logical document boundary determination process has been found to be far faster, more accurate, and far more consistent than logical document breaks provided by manual review.