When used by BR the following terms have the indicated meanings (starting with the most basic items):
Document. A writing (whether handwritten, typed, printed or typeset) or communication intended to be comprehended by a person viewing it either in paper form or when displayed on an electronic device, and usually but not always involving visual representations of words or numbers. Movies and animations are not typically but may sometimes be included within the definition of “document.”
Author. The person or system that initially created a document.
Owner. The person or business unit that “owns” or is responsible for a document.
Textual Characters. Numbers, letters of the alphabet, punctuation, and ideographs used in visually representing a language.
Meaning. The information conveyed in a document. Meaning can be conveyed in a document in several ways including:
- Words expressed using textual characters.
- Text formatting, e.g., large font size for titles or headings; bold , italics, or underlining for significant terms or phrases; and superscript or subscript for formulae or footnotes.
- Document formatting conventions whether general, industry-specific, or company-specific, e.g., the name of the author of a letter is shown at the top of the first page.
- Explicit labeling, e.g., “First Name:”
- Specialized vocabulary, e.g., “CTX1300” may refer to a particular make and model of motorcycle.
- Symbols, e.g., maps may have a special symbol for toll roads or cities of a specified size.
- Signatures, often denoting authorship, acceptance, approval, attendance, or participation
- Stamps, often denoting receipt or approval
- Diagrams and charts
Domain Expert. A person whose unique knowledge is required to evaluate documents or identify the explicit or implicit meanings in them. Ideally the domain expert would be able to provide the context in which certain document types were used, e.g., “if the document has this layout that means it was used in EDI transfers between 2001 and 2003.” Equivalent terms include “Subject Matter Expert,” or “Client Knowledge Worker.”
Record. For records management purposes, a document that an organization needs to retain for business or regulatory purposes.
Glyph. A shape or form in a document formed by pixels that are of a sufficiently different color from the background of the document as to be identifiable. A glyph may be a letter, number, punctuation, or non-textual symbol with meaning (e.g. Nike logo), or a glyph may be essentially errata or digital flotsam and jetsam without significant intended meaning (e.g., staple holes on a scanned document).
Global Glyph Catalog (“GGC”). A processing technique used by BR in which the location of each glyph is cataloged or indexed, and glyphs of the same shape are associated with each other. The GGC stores the textual values associated with glyph clusters. The GGC accumulates glyph clusters and associated values and this is rolled forward to other projects and processes so that new projects or processes do not start over from ground zero.
Glyph Recognition (“GR”). The process of associating text values with the glyphs present in a document. Single-instance Text Editing. A feature of BR in which editing the textual value associated with one instance of a glyph corrects all other instances of the other glyphs in a glyph cluster. This is made possible by BR’s process of building a Global Glyph Catalog.
DIFFERENTIATOR: Old optical character recognition (“OCR”) technology does not build or maintain a Global Glyph Catalog. It is therefore unable to provide single-instance text editing. OCR also generally yields far lower word and character accuracy because of problems like despeckling, deskewing, and problems with low resolution images.
Visual Classification. BR’s process of automatically grouping documents based on their appearance.
Granularity. Classification schemes that depend on manual classification typically have to use “big buckets” for classifications because of the limitations of the people doing the classifications. BR’s automated process provides far more useful granularity for the classification process.
Content Normalization. BR compares documents based on their content, not on an analysis of the type of file containing the contents. Scanned documents, native files, and PDF’s created directly from within native file applications can all be compared.
Text Independence. Other classification technologies require an analysis of the textual values associated with a document. These technologies will not be able to classify documents with no associated text values, e.g., image-only scanned documents, and will do a poor job when the text associated with documents is of poor quality. These factors may affect 30-40% of a given document population.
No Resource-Intensive Rule Creation and System Training. Other automated processes depend on the creation of textual analysis rules or the selection of multiple exemplar documents for each class. Those rules are time-consuming to create and are apt to be brittle in that they cannot anticipate changes in the format or vocabularies used in later versions of those documents, and manually selecting exemplars for thousands of document types is typically a slow, laborious process.
Awareness. BR’s visual classification groups documents whether or not the system administrator is aware of them or not. By contrast, other technologies require the creation of rules before documents are grouped or clustered. Periodically reviewing BR clusters or classifications keeps domain experts and administrators aware of the document types in the collection or process they are managing.
Ownership Assignment. BR’s visual classification makes it possible for organizations to assign ownership of document classifications to specific business units.
Cluster ID. The unique control number that BR assigns to each visual document classification.
Classification Labeling. The BR-enabled process in which domain experts associate document-type name labels with visual classifications, e.g., “Invoice.”
Document-Type Taxonomies. Arranging document types in a manner that makes sense to the client, e.g., document types might be arrange by originating business function at the top level, then document type, and then sub-doc type as a third level.
DIFFERENTIATOR: BR permits different business units to develop their own taxonomies to reflect the different business purposes for which they use the documents. The different document type labels assigned by the different units all tie back to the same BR Cluster ID.
Classification-Dependent Initiatives. Actions or functions that are dependent on the correct classification of documents, e.g., records retention, information management, attribute extraction, and searching.
Attribute Extraction or Visual Coding. Extracting data values from within visual classifications or clusters, generally by clicking and dragging on a visual representation of a document to identify data elements of interest, and providing field label to associate with that data element, e.g., “First Name.” Users can use inclusive or exclusive delimiters to select data elements and can use filters to format the output. An inclusive delimiter is included in the extracted value, e.g., “STD” as a text string used to find product standards starting with those letters. Exclusive delimiters would not be included in the output, e.g., “NAME” used to identify names. BR can use non-textual glyphs as delimiters to help locate data elements.
DIFFERENTIATOR: No other process, manual or automated, will permit domain experts to begin working with clusters or classifications of documents for coding purposes within 24 to 48 hours from the beginning of a project or process. Also, no other process can key off of non-textual data glyphs, making BR far more flexible and precise.
Authority Lists. A listing of the terms that are approved for use in an organization to describe certain things or attributes, e.g., well numbers, titles.
DIFFERENTIATOR: BR’s visual coding process can quickly identify all of the terms that are actually used in an organization’s documents to label or identify certain things. The BR list can be used to update the organizational authority lists. BR can also provide translation tables that will convert the “as found” terms to the preferred terms.
Negation Logic. The ability of BR to cause certain glyphs or patterns to be dropped from visual representations of documents, e.g., dropping the watermarks on birth certificates, or dropping the preprinted portions of forms.
DIFFERENTIATOR: No other commercially available software can do this.
Multi-Level Glyph Recognition. Most times when Br uses the term “glyph” it refers to single-character glyphs. However BR is also able to treat words, sentences, and paragraphs as glyphs, either in terms of the visual representations of those units or the patterns formed by the underlying textual values. Typical uses for this capability include looking for documents that share common elements with other documents, or, when combined with negation logic, being able to review large collections of contracts and be sure to have reviewed each unique paragraph (once clause was reviewed, it could “disappear” from as yet unreviewed agreements.
DIFFERENTIATOR: No other commercially available software can do this.
Single Instance. An attribute of BR in which basic processes like text editing, document type name labeling, and visual coding only need to be performed on one instance of an object and the results are applied to the other instances of that visual classification or glyph cluster as applicable.
Persistence. An attribute of BR in which text editing, document type name labeling, and visual coding remains operative for documents that are processed later.
Logical Document Boundary Determination (“LDBD”). The process of identifying where documents begin. LDBD can be vital for scanned documents where the scanning process may not have identified document beginning or for processing PDF documents where document authors may have combined multiple documents without indicating which pages begin documents.
Page-Level Visual Clustering. Clustering each page in a collection based on its visual appearance. Clients can review page-level visual clusters to determine which clusters begin documents, and those determinations can be used to perform LDBD on the collection or process.
DIFFERENTIATOR: BR’s LDBD using page-level clustering has been evaluated by clients and found to provide more consistent and reliable document boundary determinations than those provided by manual review operations. With BR, scanning operations do not need to perform operations like the use of scanning sheets or having scanner operators hit special function keys to indicate document boundaries. The labor savings and throughput improvements are appreciable.
Crazy Ivan. (From the movie Hunt for Red October, in which the submarine commander would stop and look back to see if the sub was being followed.) The BR process of evaluating a collection with known valid document boundaries to identify the page-level visual clusters that begin documents as an aid to performing LDBD on other collections.
DIFFERENTIATOR: No other technology enables clients to make use of prior document unitization decisions. This can be an important selling point when clients have already invested significant sums trying to resolve document processing challenges and don’t want to have to basically write off those sums.
Segment or Chunk. Documents that are submitted for BR processing at the same time, typically about 200,000 pages.
Convergence. A measurement of the extent to which the documents in a segment fall into visual classifications or clusters which the client has already reviewed, and, if desired, applied document type name labels and performed visual coding. Over time convergence can approach and even equal 100%.
Remediation. The process of removing unwanted files from document collections, e.g., duplicates, non-records, documents created before a certain period, or documents owned by a particular person or business unit. Migration. The process of moving documents to another environment, often accompanied by remediation and the extraction of additional document attributes or fielded data to enhance retrievability of the documents.
Hash Duplicates. Files that are bit-for-bit the same as identified by hashing algorithms such as MD5 or SHA.
Visual Duplicates. Documents that are visually indistinguishable from one another, e.g., a Word document and the PDF of the document created from within Word.
DIFFERENTIATOR: Hash values are too sensitive to inconsequential differences in documents and fail to group or associate documents that are for all practical purposes the same, e.g,. the same document in Word 2003 and Word 2010. BR is the only technology that can identify visual duplicates on an enterprise scale.
Faceted Deduplication. Although BR does not carry forward copies of each instance of a duplicate, it does track the location attributes of each copy so that BR can tell where every copy was originally located.
Tape Remediation. BR process of determining which unique files are contained on tape backups, accomplished by calculating hash values of the files while they are in computer memory and comparing them to the list of hash values of files that have already been copied onto hard drives, and then copying off only files with new hash values.
Email. An electronic message that is conveyed or conveyable as specified in RFC 2822, “Internet Message Format,” or its successor specifications.
Email Thread. An originating email (one that was not a forward of another email nor a reply to an earlier email) plus all subsequent replies or forwards.
Email Attachment. A file that was attached to an email.
Payload. All the email attachments associated with an email thread.
Email Cloth. Threads that are associated by matching the cluster ID’s of the attachments to the various emails in the threads.
DIFFERENTIATOR: Email clothing permits the identification of groups of people who are talking about the same types of things, even if there are no common senders or recipients.
Redaction. The process of making unreadable certain portions of a document. Manual redactions typically achieve a throughput of about 20 redactions per hour. Correctly redacting documents with an image layer and a text layer can be problematic for some systems, and occasionally other providers will redact the image layer but leave the text layer, meaning that the redacted terms may be recoverable.
BeyondRedaction. BR process of identifying data elements or document positions that need to be redacted to satisfy court orders or statutory and regulatory requirements. BeyondRedaction can perform 700,000 redactions per hour. BeyondRedaction also provides an audit trail or log of each redaction that was made, including the terms that were redacted, the location coordinates of those terms and the reason for the redaction.
DIFFERENTIATOR: Very few document collections have documents with 100% accurate text. BR’s ability to analyze and group documents based on visual appearance provides far greater assurance that all the terms that need to be redacted are in fact redacted.
Collector. A physical USB device provided by BR that attaches to a client’s network and collects electronic documents for further processing.
DIFFERENTIATOR: The collector does not collect software related files as identified on the list maintained by the National Institute of Standards and Technology (“NIST”), and does not collect more than one copy of files that have the same hash values. As a result documents are collected far more quickly and more efficiently than with other processes.
Terminator. BR technology that eliminates files found to be duplicates, or non-records, or which have been migrated to another platform for ongoing management.
Document Factory. A BR business model for processing paper-based documents that involves an embedded scanning unit characterized by an annual requirements contract with cost plus 15% pricing on image capture, and with charges for glyph recognition and visual coding based only on documents that are actually retained as records after visual deduping.
DIFFERENTIATOR: Many scanning operators charge three times labor for image capture and then charge separately to perform coding on all documents to determine whether they have to be retained as records. BR’s pricing model results in far lower overall cost with far faster, more accurate processing.
BR Deliverables. BR can provide the following deliverables in virtually any format:
- Fielded Visual Coding Values – as a delimited text file, in an XML file, or as values embedded in PDF versions of the documents.
- Text – as text-only files, as part of an XML deliverable including visual coding values, and as a text layer in PDF files.
- Visual Classifications – BR can provide files that provide the original path/file name for all the documents in each classification.
Glyph Search. The ability to search for a glyph independent of whether there is a text label associated with it.
Video File Comparison. Comparing video files by extracting and comparing one frame per second for the first five minutes of a video. Identifies visually identical video files regardless of differences in frame rates or resolutions.
Code Ownership. BR owns the source code for its technology and can provide flexible, customized applications.
Linear Review. The traditional process of processing documents by manually examining each of them one at a time.