Negation is a powerful new tool used to identify high-value words or graphical elements in documents, detect patterns across document types, and add a new dimension to Boolean logic.
The idea is simple: within clusters of visually-similar documents, the words and graphical elements differentiating one document from another are the ones that don’t occur in the same places on all of the documents. Disregarding or dropping off the recurring items leaves only the high-value items. Three examples:
1. Loan Application Files
Each document in a loan-application cluster will show what was on the original forms. By dropping the graphical elements associated with the blank loan applications, what is left will be the information filled in by the various applicants plus approval stamps and signatures – i.e., the information of the most value for many purposes.
To illustrate this, assume applicants had to provide copies of their IRS Form 1040’s. This is what the blank form looks like before being filled in:
This is a filled-in version, note that much of the document image for the filled-in form is that same as the image of the blank form:
After negating the recurring elements of the 1040’s, what is left is the unique content that was filled in on each form:
Once the highest value terms are left, they can be used to group documents across clusters and across document types to find different patterns. In the loan application example, a borrower’s name, address, social security number, and phone number would be among the high value words left after the forms were negated. The high value words from one cluster can be used to locate other clusters where they also appeared. For example, the borrowers name, address and phone number would also appear in credit check authorizations, utility bills, magazine subscriptions, drivers licenses, employment applications, etc.
2. Negation Applied to En Masse Contract Reviews
Negation can be applied at the glyph or graphical element level, or at the level of word, paragraph, or page levels. For example, if an organization wanted to undertake a review of all of its agreements, negation could be applied at the paragraph level so that once a particular paragraph was reviewed and approved, it would be suppressed on the remaining agreements. After all, if the same limitations of damages or venue provisions are used in most or all of the agreements, there is no value in repeatedly examining each use of them. Those common provisions become clutter that detract from identifying and analyzing clauses that are exceptions to generally-used provisions.
The review can be optimized by presenting agreements ranked according to how many negations can be achieved by reviewing them.
3. Overcoming Brittleness of Boolean Search Logic (Adding “Disregard” to “Include” and “Exclude”)
Negation can also be used in search logic to overcome the inherent over- or under-inclusiveness of Boolean logic. Boolean logic either includes documents because search conditions are met (AND or OR), or excludes them (NOT or XOR). So for example, if a researcher wanted to find documents in a collection taken from a forest products company, they might want documents that contained “pine” but might not want to see documents just because they contained the phrase “Bill Pine.” Searching for “pine NOT ‘Bill Pine'” could exclude documents that had “pine” just because they also contained “Bill Pine.”
With negation, the phrase “Bill Pine” could be disregarded, meaning that it would not contribute to the inclusion or exclusion of documents containing those words. The search logic could be applied to documents as if “Bill Pine” never occurred in them. Negation essentially adds the ability to “disregard” terms prior to applying Boolean logic’s original “include” and “exclude” operations,
Simple But Not Easy
The concept behind negation is simple and straight-forward. What is difficult is the implementation. This requires being able to catalog the individual glyphs or graphical elements where the original documents may have multiple dots-per-inch resolutions, e.g., the system must be able to match documents scanned at 300 dpi to faxes transmitted at a much lower resolution, plus match those to original native files. In other words, the BeyondRecognition system has to look at the geometrical relationships among the glyphs or graphical elements, not just the pixels themselves.
Perhaps the biggest challenge is to be able to perform all those calculations and operations at enterprise scale so that tens or hundreds of millions of documents can be compared. BeyondRecognition has that scalability.
For more information on how visual classification and negation can be applied in your organization, contact us at IGDoneRight@BeyondRecognition.net or submit information on the form to the right of this blog.