In content management, negation involves the ability to focus on items that are relevant for a particular purpose by removing irrelevant items from consideration. The basic idea of negation is familiar to people who have used the Boolean logical operators “NOT” or “XOR” for full text search – those operators remove irrelevant documents from search results.
The power of negation for content management is greatly expanded when it can be applied at varying levels of granularity, especially when it can be applied to non-textual graphical elements and to document zones.
Non-Textual Glyph Negation at Document Cluster/Document Level
Being able to ignore or drop out some graphical elements can make downstream processing much more effective. For example, some official documents such as birth certificates or marriage certificates have watermarks that are intended to make forgeries or alterations more difficult. One of the unintended consequences of watermarks is that they make character recognition more difficult because adjacent characters appear to be touching, making it problematic to extract the correct text values.
One solution is to negate the watermarks, causing them to drop out of the image for purposes of extracting text values.
Non-Textual Glyph Negation for Document Retrieval
There are times when trying to retrieve documents that non-textual objects can be used to narrow the search. For example, the results of an initial search may include many irrelevant documents that contain a particular graphical element such as a logo in the heading, or a particular illustration where there is no accompanying text, and you’d like to exclude or negate those documents. This is essentially using NOT logic but with non-textual criteria. This can be accomplished using glyph-based negation. For example, you could select documents that did not contain the following logo:
Zonal Coordinate Attribution at Document Cluster Level
When coding documents, i.e., extracting attributes that are visible on the document to load in specified fields in a content management system, it is useful to be able to click and drag to define the zones within which you want to extract the values. Defining the page coordinates is negating any values found elsewhere.
Zonal negation makes targeted value extraction much easier because the extraction rules don’t have to contemplate dealing with all the false positives that can result from examining all the content in a document.
Identifying High-Value Variables in Document
In content management, document classification and date element extraction are sequential steps. Documents are first classified and then attributes are extracted based on what type of document it is. The common word patterns that help group documents for classification purposes become clutter for purposes of discriminating among members of the class for search purposes or for attribute extraction purposes.
Take the case of contracts that were originally created using an automated document assembly program:
When there is no access to the database that held the variables, negation can be used to help identify the variables. The final executed contracts are digitized and grouped based on visual similarity, and the system can identify which visual elements occur in the same relative places in all the documents and then negate or drop them, leaving just the high-value variables:
There are three primary benefits of negating the terms from the underlying template or boilerplate:
- It is much easier to parse the variables into desired fields without having to worry about false positives occurring in the common terms.
- Having isolated the high-value variables, the system manager also has the option of storing only the high value variables in one field.
- The text index to the documents could exclude the common terms greatly reducing the computer resources needed to build and maintain the content and greatly reducing the number of false hits when searching the content. Note that because the negation of common terms is positional, words that were high-value variables will be retained even if other instances of those terms were in the boilerplate. For example, if the venue provision in the contract specified Dallas, Texas, that provision gets negated, but if a signature block had the name Dallas Jones, that “Dallas” would be retained.
Paragraph Negation – En Masse Contract Review
There are occasions when an organization needs to review all agreements of a particular type, e.g., a company acquires another company’s assets and want to review the acquired equipment leases to evaluate conformance to certain standards. Negation can be used to only view one instance of each unique paragraph.
Assuming the contracts are all in digital form the critical step is to identify the paragraphs in the contracts and then group duplicate paragraphs. Hash-based deduping of the text associated with paragraphs may be overly restrictive as extra spaces, differences in line breaks or fonts, or text conversion errors used can cause false mismatches resulting in needlessly duplicative review effort.
Using paragraph-level negation not only greatly reduces the volume of content to be reviewed, it also results in a far more consistent review. Quality is enhanced because the reduced amount of content reviewed means it can be performed by top-rated knowledge workers without using temporary labor or lower-level workers.
For more information on how to manage unstructured content, download your free, personal-use copy of Guide to Managing Unstructured Content, Practical advice on gaining control of unstructured content, at: http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/