Sometimes a large percentage of files found in unstructured content locations like file shares and ECM systems were actually created by database-driven business systems. These documents are essentially filled-in templates populated with specified database elements. Whether stored as PDF or TIF, these computer-generated files are completely redundant to information stored in the database and could be safely disposed of so long as the database elements and a blank copy of the template were retained.
The benefits are obvious:
- Less clutter in the unstructured content systems.
- Less legal review expense.
- More usability – structured data is far more amenable to analysis, sorting, and reporting than normal unstructured content.
The key is to be able to consistently and accurately identify the computer-generated documents that could be disposed of. For these purposes, visual classification is in a class by itself with a classification accuracy rating of well above 99.99% accuracy – even on image-only files or documents. Enterprise content managers can securely identify which classifications consist entirely of computer-generated files and confirm that the original system of record retains the needed information before ordering disposition.
For an even heightened level of defensibility, visual classification can also be used to extract the data elements used to create the computer-generated report and those data elements can be used to confirm that the database-driven system does in fact have all the same content.
The ability to extract original data elements gives extra flexibility to working with “unstructured” content even when the original database is lost or unavailable. Content managers can extract data elements and rebuild tables and rows that were originally used, and this can be quite useful in a number of contexts.
For example, if a company was required to produce data showing sales to a particular type of customer in a particular region for a given time period, individual invoices could be processed, the results loaded in a database or spreadsheet, and then a simple report could be created. This could be far less burdensome than producing potentially tens of thousands of individual invoices and would give the producing party the flexibility to not include information in the report that was either superfluous for the stated purpose or consisted of PII or trade secret-type information.
For more information on managing your unstructured content, email firstname.lastname@example.org.
- Documents ARE Structured – Just Heterogeneously http://beyondrecognition.net/documents-are-structured-just-heterogeneously/
- Boosting PII Detection and Protection in “Unstructured” Content: http://beyondrecognition.net/boosting-pii-detection-and-protection-in-unstructured-content/
- Need More than Text Search for Unstructured Content: http://beyondrecognition.net/need-more-than-text-search-for-unstructured-content/
To request your copy of Managing Unstructured Content, Practical Advice on Gaining Control of Unstructured Content.