Perhaps the best way to see how new technology can be used is to in essence “ride along” as it is being put to use by a client. That’s what this posting does – takes you along during the critical phases of implementing a visual classification project.
We pick up after the client has mapped out what files are to be collected, they have been collected and ingested, and the clusters of visually-similar documents have been formed automatically. Note that the clustering or classification does not use or require text – documents are classified based on their visual appearance, not on file type.
This is the screen layout the client uses to interact with the clustered data:
The client begins by reviewing a few documents from the largest clusters, those with the most documents in them. At the beginning of a project this may be a somewhat ad hoc process as the client is just trying to gain a sense of the types of documents present in the collection.
Cluster review can begin very early in the project as the clusters will only grow in size over time. Because the largest clusters have, by definition, the most documents, clusters representing a large percentage of the overall documents can be reviewed in a relatively short time. As more documents are processed, the client can review only new clusters that form.
Here are examples of some of the things that clients have discovered during this early review process are the following:
- The software code for a major business system had been left unencrypted and in plain view on an unsecured file share.
- Numerous system log files with no ongoing value had never been cleaned up.
- Nontextual documents like image-only PDFs that had been literally invisible to the client’s text-based retrieval system are organized and reviewable for the first time ever.
- Document types that routinely included PII were quite apparent.
- Multiple pages were found in documents that should have been single paged – meaning that in most systems the second and subsequent pages would have been very difficult to locate. For example, in the oil & gas industry, many tests and certifications are on one-page reports which are indexed by well number. When the second pages relate to a different well, they are placed in the wrong well file or folder and are virtually hidden from sight.
- Multiple document-types have been included in single PDF or TIF files, causing the buried documents to not have correct document-type designations and to not have appropriate metadata extracted. They are literally invisible in the system.
- A scanning vendor’s bar code sheets had been left in a repository of scanned documents.
After a general perusal of the data, the client begins a systematic review of all the larger clusters to determine whether or not they contain documents that have any ongoing business, regulatory, or litigation value. If they do not they can be tagged for disposition. This simple step of weeding out unwanted files can save appreciable drive space and dramatically lower the number of files.
If the cluster contains documents that have ongoing value the reviewer designates a document-type label to associate with the documents in the cluster, using a user-definable three-level document-type tree or taxonomy. The document type designations then become the basis for assigning retention periods, assessing PII and other privacy concerns, determining where to store the content, and what job positions in which business units should have access to them.
Other postings of interest:
- Metrics of File Share Remediation, AKA Defensible Deletion LINK
- Information Governance Lessons from 4 AFEs and a Daily Drilling Report LINK
- Basic Assumptions Gone Wrong, ECM and Document Unitization LINK
- TAR Defensibility Soft Spots: Text-Dependence and Document Unitization LINK
- Metrics of File-Share Remediation LINK
- Document Type Risk Assessment: The InfoSec Key for Unstructured Content LINK