Glyphs can be used to consistently deconstruct, classify, and attribute large volumes of files permitting effective management of them. Deconstruction breaks files into their smallest visual elements, classification uses data visualization at the page level, and attribution selects specified glyphs or their text values from within classifications.
The word “glyph” has several meanings. In typography, it refers to a symbol that has an agreed meaning in a set of symbols used to represent the alphabet, numbers, or punctuation within a typeface. The essential idea is that for typographic purposes, glyphs are the lowest unit of information, the building blocks for other larger units such as words, sentences, paragraphs, etc. As used in visual file classification, a glyph is any graphical element apparent on the face of the document. It could be a letter, number, or punctuation, or it could be a staple hole, a logo, or a line separating two parts of a document.
Deconstructing a document means identifying and describing all the glyphs or graphical units on each page and cataloging the page coordinates where they occur. The initial cataloging does not involve determining if there is any text directly correlated to any glyph. Non-textual glyphs such as lines, boxes, logos, graphs, charts, maps, plats, illustrations, or signatures are included because they can help define and identify document types during classification, and managers may want to extract these non-textual attributes for special purposes during attribution, e.g., to compare signatures of the same person from different documents.
By cataloging the shape and page coordinates of each individual glyph the system can analyze various aspects of the documents, as discussed next.
“Glyph” as used in data visualization refers to the graphical elements representing characteristics of objects in a visualization. “Glyph” also refers to the overall visualization.
In glyph-based classification, the system builds visual profiles of what each page looks like and clusters visually-similar files based on their profiles. This works whether the file was an original native file, a scanned document, or a fax, and it works with documents written in different languages.
Most importantly, glyph-based classification process does not depend on having extractable text in the file. For example, the following thumbnails are representations of an invoice, a letter, an agreement, and an email. Although none of the individual characters can be read, it is clear from the layout of the pages which document types are represented:
Knowledge workers assign document-type classifications to the clusters using a multi-tiered classification scheme. Glyph-based classification is more accurate and consistent than text-restricted algorithms because it has a richer set of data to analyze.
Managers responsible for ECM and Information Governance typically want more than classification to be able to manage their files. They need to know about the attributes of those files, sometimes just specific values from particular document types, e.g., API Well Number from Well Logs, and sometimes they want to be able to search or edit any of the text associated with the files.
Having a catalog of all glyphs from all files in a collection enables knowledge workers to map the text values associated with individual glyphs. Once the associations are established, the text values are known for all occurrences of those glyphs. For example, there might have been millions of occurrences of the glyphs associated with letter “a” as shown in the illustration at the top of this post. By associating the text character “a” with all nine of those glyphs, the knowledge worker would have established the text value for all such occurrences. Mapping the text values for glyphs is a far faster and more accurate way of creating text layers and extracting text attributes than optical character recognition technology.
Using glyphs to create graphical representations of documents enables an analysis of them with different focal points based on the purpose for the analysis. For example, the focus can be on individual symbols, words, phrases, lines, paragraphs, blocks, tables, pages, or documents. By building associations between text characters and those various glyph focal points, the analysis can shift between purely graphical and textual representations during attribute extraction and content enablement.
Documents within any visual cluster are very similar, and identifying where a desired attribute occurs in one member of the cluster usually results in identifying where those values occur in all the files in the cluster, making the extraction of specific attributes much easier.
For information on managing unstructured content, go to the following link for your personal copy of my recent book, Guide to Managing Unstructured Content http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/
- Glyph Definition – Typography: https://en.wikipedia.org/wiki/Glyph
- Glyphs in Data Visualization: https://en.wikipedia.org/wiki/Glyph_(data_visualization)
- Egyptian Hieroglyphics: https://www.shutterstock.com/pic-258362537/stock-photo-egyptian-hieroglyphs-on-the-wall.html, copyright Fedor Selivanov.
- Glyphs for “a”: https://en.wikipedia.org/wiki/Glyph#/media/File:A-small_glyphs.svg
- Data Visualization: https://en.wikipedia.org/wiki/Glyph_(data_visualization)#/media/File:Scatter_plot.jpg