It can cost over $8,000 to originate a mortgage and take a month to complete the processing. A good part of the cost and delay is attributable to the use of non-delineated, multipage and non-searchable TIF/PDF “Blobs” to store scanned or faxed copies of the underlying documents. This is an overview of problems associated with storing loan documentation in blobs, and how they can be overcome.
Here are potential problems with using Blobs to store scanned or faxed documents:
1 – Imprecise pointers. Database entries in loan tracking systems should all be supported by loan file documents. However, databases can typically only point to an entire Blob, not to individual documents or pages stored in the Blob. This makes it difficult to find documentation supporting specific data points. Loan analysts and underwriters may have to review the hundreds of pages that can accumulate in a loan file much like manually browsing through large stacks of loose paper to find things. This is a real problem when having to produce specific loan documentation, e.g.,
- The borrower defaults and key documents must be analyzed by lawyers before foreclosure proceedings can be instituted.
- Fannie Mae, the VA or some other agency wants to audit loans.
- The lender has to perform post-closing QA on their loans.
2 – Missing, incorrect, or inconsistent document classifications. People familiar with a given loan tracking system and underlying mortgage documents will know the types of documents from which certain data elements are usually extracted. They will also know which document types are needed for certain tasks such as foreclosure review. However, document classification issues make it difficult to find these documents:
- Unclassified documents. Documents can be added to a blob without being classified.
- Misclassified documents. Classifications are often wrong. When users look for one document type they miss some documents but see others that should have been given a different classification.
- Inconsistent classifications. Some classifications are “correct” in that they accurately describe a document but are they unusable because they are not consistently applied or normalized. For example, an IRS Form 1099 could be called any of the following terms:
- Federal 1099
- IRS 1099
- US IRS 1099
- Fed IRS 1099
- Treasury 1099
- Miscellaneous Income
3 – Incorrect document units. Blobs typically contain multi-page TIF or PDF files each of which may contain multiple documents. Also folders in some areas usually contain single-page TIF or PDF files where multiple files can be pages from the same document. The multiple-document files need to be split apart and the single-page files need to be grouped together into documents to provide a solid foundation for classification.
Limitations of Text Analytics
Companies that automate the loan process typically try to use text analytics to classify loan documents. This is only partially successful for several reasons:
- Non-textual documents. Many documents in loan files are faxed or scanned copies of signed originals, and they often do not yield accurate text from OCR processing.
- Incorrect document boundaries. As noted above, blob files can contain multi-document or single-page TIF/PDF files where one file does not equal one document.
- Text analytics does a poor job of establishing correct document boundaries. For that reason, manual document unitization has historically been required before classification has any hope of working. This is expensive, causes delays in processing, and creates privacy and security risks for the loan files.
- Inherent text limitations. Even with perfect text and with perfect document unitization, text analytics does a poor job of classifying files. It takes extensive upfront effort to write text-based rules to classify documents or to select exemplars for use by text analytics engines. With either approach, even extensive tuning will not avoid false positives and false negatives. In fact, classification problems become worse with text-based systems as the size of the collection increases.
New Technology and Approach
Visual classification technology resolves multiple document issues associated with blob storage. The technology is a kind of facial recognition for documents that groups documents based on their visual appearance, i.e., it is not based on text-analytics.
Visual classification mimics what someone could do if handed a stack of Russian-language documents and told to put them in piles of similar-looking documents. Even if that person couldn’t read Cyrillic, documents could be grouped with other visually-similar documents, e.g., closing checklists could be put in one pile, inspection reports in another, bank statements in another, Truth-In-Lending disclosures in another, and so on.
Once documents are placed in visually-similar clusters, those clusters can be easily labeled or classified. Here is how visual classification resolves the various blob-related document challenges:
Consistent Classification without Text Analytics
The grouping of visually-similar documents is objectivist in nature. The clustering is automatic and does not involve a front-loaded effort to try to guess what document types might be present and how to define them. The clustering is a constant over the life of the project or process. It is figuratively the North Star of document management. Any subsequent document that falls in the cluster gets the label you applied to the cluster. The only documents needing new classifications are ones that form new clusters – in fact, the effort required to maintain classifications becomes smaller as the collection grows because more documents fall into already-established clusters.
Being text independent, visual classification works even where there is no text or only poor-quality text. Faxes and scanned images are classified consistently because the geometric relationships among the graphical elements on the pages remain despite differences in scanning resolution and even with wide variances in image quality.
Visually-similar clusters become the building blocks for whatever document-related tracking system you want to design or improve. You can combine multiple clusters by giving them a common classification or label, as usually happens with, e.g., invoices. By having one name for a cluster or set of clusters, you avoid the recurring problem with text-based systems of having multiple labels that may apply to essentially the same type of documents.
Visual classification operates on a page-level as well as a document-level. In other words, visual classification can cluster visually-similar pages as well as visually-similar documents. As part of the process of clustering documents, the system learns what the first pages or beginning of document (“BOD”) pages look like. Alternatively, the intelligence gained processing other mortgage file collections can be applied to new collections. This intelligence flags the BOD pages that determine where document boundaries should begin when dealing with multi-document or single-page PDF/TIF files.
The automated approach to identifying correct document boundaries is more accurate and more consistent than document boundaries established by operator review of image files – and certainly much faster.
Alternatives to Blobs
Once files have been split or combined as needed into logical document units and the documents have been consistently classified, there are several options on how to store and access them:
- Bookmarked PDF Loan File. The whole loan file can be stored in one PDF with bookmarks for individual documents, using document classification labels for bookmark names. The loan tracking database could point to specific pages or bookmarks within the PDF.
- Loan Folder & Documents with Naming Conventions. Another option would be to create server folders to store loan-related documents, naming the folders based on the loan numbers and borrowers’ names. Individual documents could then be placed in the folder, with files named for the loan number, borrower, document type, and date in YYYYMMDD format. The loan tracking system could point to specific documents, pages, or even page coordinates.
- Improved Blob. The properly unitized and classified documents could be stored in blobs where the unitization and classification would make it easier to work with them even if the loan tracking system couldn’t point to specific documents, pages, or page coordinates.
Data Extraction/Document Coding
Visual classification offers multiple advantages for extracting data values from loan documents, and this will be covered in a subsequent posting.
“Limitations of Using OCR for File Classification,” June 6, 2016, http://beyondrecognition.net/limitations-using-ocr-file-classification/