BeyondRecognition technology is unique in classifying electronic files and scanned documents using graphical analysis.
All documents have visual representations, but not all document files have text layers or text that accurately represents their contents. As a consequence, BR provides a far more comprehensive classification of an organization’s documents than traditional text-dependent approaches. Further, graphical analysis is far more consistent and reliable in identifying document types than text-based approaches.
Without using any associated text, BR takes your unstructured content and help you organize and classify it. You can quickly go from unknown, unclassified files to having groups of visually-similar files:
BR provides all the technology required to collect files from any URI-addressable storage device, classify native and scanned documents, extract attributes from them, and load results in any content storage system. More detailed descriptions of BR’s technology are available for download under the Resources section, but here is a high-level summary of BR’s technology, presented in the order it would be used in a typical workflow.
BR provides USB collectors that attach to clients’ networks. Collectors use an inherited security model and have only the access rights established for the user connecting the USB device and the workstation to which it is attached. Collectors contain scripts defining the drives or devices they will examine. All files on the target drives or devices are hashed. The hash values are compared to lists of known system files and to listings of already-collected content files. Each instance of each file is logged to record location, file name, and hash value. If the hash value for a file has not been included on any earlier list of hash values, the file is copied onto the collector, using proprietary compression algorithms and up to 256-byte encryption.
Because of the bit-level file deduplication and compression, collectors typically can collect from drives several times larger than the USB device.
Data from the collectors are placed on a BR server where further bit-level deduplication across collectors takes place. The BR server can be located on BR’s cloud or behind the client firewall.
The BR server creates visual representations of all files and groups or clusters documents that are visually similar. The clusters are self-forming, there are no rules to write, and no exemplars or seed sets to select.
Because grouping or clustering is achieved using visual classification, native files will be clustered with visually-similar image-only PDFs, faxed documents, or scanned documents.
BeyondRecognition provides work flow technology that enables clients to quickly assess the clusters of visually-similar documents that have formed during the clustering phase. There can be variations in the workflow depending on the reason for the assessment. For example, during file share remediation the assessment decisions are usually (1) are the clusters “records,” (2) if they are, what document-type labels should be used, and (3) what attributes should be extracted from them. For discovery review the decisions may be (1) are the documents relevant and (2) if so, do they contain PII.
Each of the companies in the BR Network of companies has domain expertise in applying BR technology in different industries or contexts. For present purposes we’ll discuss the file share remediation example.
a. Retention or Disposition. The initial assessment is whether the documents in a given cluster serve any ongoing business, regulatory, or legal purpose. This disposition or retention decision can be made by reviewing one or two documents per cluster yet affect all documents ever placed in the cluster, potentially impacting tens of thousands of documents with individual decisions.
Clusters are typically reviewed starting with the clusters with the most documents, and typically within 20 hours a domain expert can review one or two documents per cluster for well over 99% of the documents in the collection.
b. Document-Type Designation. If the documents in a cluster are going to be retained, BR provides an interface where a domain expert can assign document-type labels to the documents in that cluster. The document-type taxonomy is usually set up with the top level being the business unit, then document type for the second level, and sub-type for the third level.
As with disposition or retention decisions, decisions as to document typing will apply to all documents that are ever placed in the cluster. Once a point of convergence is reached, the BR system will be placing new documents in already-established clusters and document-types, making maintenance of the system far simpler.
c. Attribute Extraction. The domain experts who are making the disposition/retention decisions and designating document types can also identify the specific data elements to extract from the document types, e.g., to extract API Well Number from well logs, or to extract applicant’s social security number from a loan application.
BR provides a special graphical interface for the operators who identify where those data elements are in each cluster. Clicking and dragging on one document in a cluster operates to extract that data element from all documents in a cluster from that location. BR provides a set of delimiters to help select attributes within regions as well as a set of filters to format the extracted content.
After attributes are extracted it can be processed through various quality control processes that will vary with the purpose of the processing and the constraints involved.
There are a number of options available to clients for receiving the results of BR processing:
- Delimited file containing the document identifier, cluster ID, assigned document type, and extracted attribute values.
- Image-over text PDF or TIF that can include any of the values from the delimited file.
- Custom load file designed for client’s designated content storage system.
BeyondRecognition provides special processing for emails, and can read and extract values from PSTs, Notes, Groupwise or Novell storage files. Email processing can be summarized as follows:
- In preparation for deduping across different types of email storage containers, normalize all email messages to RFC 5322 to eliminate differences introduced by those systems.
- Dedupe email messages and attachments across custodians.
- Consolidate emails that are included within later, longer emails in a thread.
- Process email attachments along with other native files or scanned documents so that they can be assessed. Those assessments can be assigned to or inherited by the emails which transmitted copies of them.
For more detail, download the Technical Overview.
BR provides two ways for clients to achieve high speed, accurate redactions. BR can identify words or terms that are in the form of patterns defined by the client (e.g., “nnn-nn-nnnn” for social security numbers where “n” is a digit) or lists of terms. It also enables clients to redact zones within specific areas within document types or clusters even if there is no searchable text in those zones. Here are two examples:
Searching for a text pattern where original document was typed:
Redacting a zone within a cluster of 1040’s to redact even handwritten content: