All successful document-centric IG initiatives need accurate, consistent document classification.
BR provides that foundation.
Here are examples of where BR has been used to provide rapid, scalable classification of native electronic files as well as scanned or faxed paper documents.
File Share Remediation
BeyondRecognition technology helps clean up file shares and other general file storage devices by removing duplicate files, identifying unneeded document types, and extracting critical document attributes. It does this by:
- Identifying bit-level duplicate files that can be defensibly deleted,
- Clustering visually-similar files so the clusters can quickly be reviewed to determine:
- Whether the clusters contain items of business, regulatory, or legal value and need to be retained,
- What document-type label to associate with the clusters, and
- What document attributes to extract from the face of the documents in the clusters.
The number of clusters is generally less than one percent of the number of files examined, and by reviewing the largest clusters first, domain experts can typically assess clusters representing over 99% of the files for retention/disposition and document-type designation in less than a week. Another group of specialists perform zonal attribute extraction using a graphical interface, and this will generally take weeks or months.
As a result of eliminating duplicates and discarding unneeded files, the process of BR remediation typically reduces the space required to store the remaining files by up to 70% compared to the starting volume. There are other equally significant benefits to using this approach to remediation:
- Granular retention implementation. Records retention schedules can be implemented at the document-type level, providing an effective way to avoid the seemingly endless retention of unneeded documents.
- InfoSec risk assessment. While clusters are being assessed initially, information security specialists can evaluate the clusters to determine if they are likely to contain personally identifiable information, payment card industry information, HIPAA-protected information, trade secret, or other sensitive information. These assessments can drive decisions on where to store specific document types and who should be given access to them.
- Enhanced usability. Being able to associate consistent, predictable document types with the retained documents and being able to extract key document attributes makes the final collection far more usable and useful for ongoing business purposes.
BeyondRecognition expedites migrating content to new management platforms in several ways:
- Consistent classification. BR can analyze documents in the target system that have already been classified as records and can use them to learn what records look like and the names of the document types that have been assigned to them (see “Crazy Ivan” on the BR 101, Terms and Concept resource). This further expedites the migration by helping assure that only content that has ongoing business, regulatory, or legal value is retained and that it is retained using consistent classification.
- Full attribution. Target systems typically have a tagging or document attribute system to permit tracking of fielded information about the files and documents being managed. As part of its processing BR captures existing object-level metadata and carries it forward and also provides visual attribute extraction technology to capture data elements that are visible on the face of the documents in the clusters.
- Content enabling. As part of its processing, BR adds a text layer to image-only documents, enhancing their retrievability and usefulness in the target system.
- Consolidating native electronic and scanned paper document content. BR’s visual classification system groups or clusters both native electronic and scanned paper images consistently, permitting organizations to manage both electronic and scanned collections in the same management system. Word, PDF, scanned TIF, and faxed documents are all treated consistently for classification, and BR adds a text layer to image-only versions for retrievability and attribution.
Paper Archive Digitization
Nowhere is the value of visual classification more apparent than with digitizing paper archives. Original documents may have been in poor condition and the image capture may have been done over many years using a variety of hardware and software leaving document images of varying quality and resolutions and sometimes even without document boundaries, e.g. with entire folders or boxes scanned into single PDFs or TIF files.
Here is why BR is invaluable in converting scanned or faxed paper archives into useful, functioning content collections:
- Text-independent classification. BR’s initial clustering of documents is done based on graphical analysis of visual representations of the documents, not on associated text or text layers. This means that documents scanned at 100, 150, or 300 dpi and faxed documents can be classified consistently with native electronic files such as Word or PowerPoint.
- Retention/Disposition and Document-Type Assessment. Once the scanned documents have been clustered the client can review them to determine if they need to be retained, what document-type label to associate with each cluster, and which attributes to extract from the documents in each cluster.
- Logical document boundary determination (“LDBD”). By doing page-level visual classification, BR can quickly learn what the beginnings of documents look like, and this knowledge can be used to separate multi-document files into discrete documents, greatly improving many information governance tasks like assigning retention/disposition schedules, content retrieval, and setting appropriate access rights to discrete documents or document types. This LDBD functionality provides clients with the option to speed day-forward scanning operations by skipping document boundary capture tasks at the time of scanning.
- Adding text layers. As part of its processing, BR adds text layers to image-only documents for purposes of enhancing subsequent retrieval in target management systems and for attribution.
- Visual attribute extraction. BR technology permits clients to extract data elements or attributes from documents by clicking and dragging on one document image per cluster.
PII Detection and Protection
Most personally identifiable information and other sensitive data (including PCI, HIPAA-protected data, intellectual property, and trade secret information) occurs within predictable document types. For example, IRS Forms W-2 and 1099 will contain social security numbers or tax IDs, and hospital admission forms will contain sensitive financial and health information.
Visual classification permits clients to cluster all electronic and scanned or faxed paper documents to perform an information security risk audit while those clusters are being assessed for retention/disposition and being assigned document types. Those assessments drive which clusters should be retained, where they should be stored, who should have access to them, and when they can be flagged for disposition.
In addition, BR’s technology permits mass redactions of PII and other sensitive information based on either text patterns, or zonally within page location coordinates for specific clusters. BR can perform in excess of 700,000 redactions per hour and provides a log of the specific locations redacted and the reason for each redaction.
Download your copy of Boosting PII Detection and Protection in “Unstructured” Content.