Text analytics does some remarkable things with what it’s able to see, but in one critical aspect it is a giant leap backwards to the days of telegraphs and stock ticker tapes when information was delivered on continuous strips of paper with just numbers, letters, and basic punctuation printed on them.
In those days, the long strips of paper could be cut into pieces and taped or pasted together to form pages. Nowadays, text analytics takes documents and cuts them into short strips of characters and punctuation, and assembles the short strips into continuous text strings. A ten-page document becomes the equivalent of a 100 foot-long ticker tape.
For text analytics, a word is a word. Words have a one-dimensional order with each word either in front of or behind other words on this virtual ticker tape. There are no concepts like logos, graphics, form lines, signatures, page orientation, and how things are arranged or placed on a page – the sort of thing that authors spend long hours composing and adjusting to convey the correct meaning.
Now in point of fact, text analytics is often able to do some pretty useful things with the information that it is able to see. But keep in mind that it is inferring meaning from just the text it is able to see. Visual classification is able to use a much richer set of data on which to base it’s analysis – what documents actually look like.
Because visual classification knows what documents look like, it doesn’t get confused by differences in the formats of the files used to store and transmit the documents. Copies of the same documents in Word, PDFs saved from Word, and scanned TIF versions all look alike and get classified alike.
Some text analytics engines disregard numbers which means that a spreadsheet is represented only by row and column headings. Others infer meaning from sentence structure which means that spreadsheets may be completely disregarded along with PowerPoint slides and numerous categories of document types.
Because so much useful information is unavailable to text analytics engines, they are unsuited for enterprise-scale document classification processes that involve placing documents in discrete document types so that subsequent classification-dependent initiatives can be undertaken, e.g., retention, remediation, migration, and digitization.
TAR technology is often used as an expedient to address short-term needs to sort documents into general categories of responsive and non-responsive for discovery in litigation but falls short of the scalability, granularity, and consistency of visual classification for purposes of meeting ongoing enterprise-scale information governance needs.
An early forerunner to today’s TAR project teams:
For more information on how visual classification can help you meet your document-centric information governance needs, contact BR at IGDoneRight@BeyondRecognition.net or use one of the several contact forms on this web site.