At one point in time, sextants were cutting edge technology. By enabling sailors to measure the angles between the horizon and heavenly bodies, sextants permitted them to determine how far north or south of the equator they were. While that was useful information, sextants didn’t tell seamen how far east or west they were, with sometimes tragic consequences: ships occasionally ran aground at night or in bad weather when they were closer to shore than the sailors realized.

The subsequent development of chronometers enabled sailors to determine how far east or west they were of the Prime Meridian, an imaginary north-south line that runs through Greenwich, England. For example, by keeping the chronometer set to Greenwich time, sailors would know how many hours after noon London time it was when they observed noon local time based on the path of the sun when shadows were the shortest. That time difference would tell them how many degrees the earth had rotated past the Prime Meridian.

Only when sailors had both latitude and longitude could they plot the intersection of those coordinates and know precisely where they were. Of course, today GPS systems using satellite positioning are far more precise as well as being inexpensive and widely available.

gps technology surpasses sextants in accuracy and ease of use

Text analytics is in many ways like a sextant: It helps analyze one type of data, and at one point in time it provided the best information available. However, subsequent technology is now augmenting and overtaking simple text analytics.

Limitations of Textual Analysis

Sextants were of little value when there was dense fog cover or overcast skies, and text analytics is of little value when some conditions are less than optimal, e.g.,

  • Missing text. Files without text layers are essentially invisible for text analytics purposes.
  • Incorrect Document boundaries. Problems with document boundaries can be very problematic, e.g., when multiple documents are included within one file, or each page is a separate file.
  • Multiple languages. Similar documents that are in different languages generally won’t be grouped together using text.
  • Non-Transparency. Most users don’t understand how text analytics work, and its functioning can vary from collection to collection.

All of this is compounded by the cumulative bias of a text-based approach. For example, if key-word searching is used to select documents to analyze, the initial selection process will have an extreme bias against no-text or poor-text-quality documents. And even with perfect text, full text searching often returns numerous false hits while missing many relevant documents.

The New Analytical Framework: Visual Appearance

New technology is enabling another useful framework for the analysis of documents: their visual appearance. In this approach, document grouping and classification in based on thumbnail-type views where individual words are not distinguishable.

Here are some advantages of visual classification:

  • Groups of visually-similar documents can have consistent classification labels assigned. This enables precise retrieval based on document type, avoiding much of the need for full text search.
  • The technology learns what beginnings of documents look like and can assign document boundaries to multi-document files and single-page files.
  • Visual classification is language- and text-agnostic for classification purposes.
  • The automatic self-forming clusters of visually similar documents avoids the substantial project setup and maintenance problems associated with text-based classification.
  • The technology is scalable to millions of documents per day.
  • Re-usable intelligence – the classifications and labels can be used on other document sets to repurpose the initial work in classification and labelling.

Just as chronometers and eventually GPS augmented and largely replaced sextants, visual classification is augmenting and, for some purposes, supplanting text analytics.


Guide to Managing Unstructured ContentFor more information on managing unstructured content, you can download your free, personal-use copy of my book, Guide to Managing Unstructured Content at:

http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/

Further reading:

Dava Sobel, Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time, https://www.amazon.com/Longitude-Genius-Greatest-Scientific-Problem-ebook/dp/B003WUYE66

Comments are closed.