There are profound differences in the capabilities of a glyph-based document processing engine compared to legacy optical character recognition (“OCR”) systems.

From a process efficiency viewpoint, OCR treats each potential character as a fresh recognition task, meaning that even if precisely the same pattern of pixels had already been recognized, that same pattern will be put through the same OCR process with presumably the same results as before. If there are a thousand pages of images with an average of 1500 characters, the OCR process will run 1.5 million times, repeatedly performing the same task.

By contrast, if the same font had been used in all thousand pages there will a glyph cluster for each upper case and lower case letter, each number, and each punctuation mark for somewhere less than 100 glyph clusters with potential text values. The text creation process would only have to process one glyph for each cluster, using the optimal glyph from each cluster. Because glyph clustering can take place much faster than character recognition, the glyph cataloging process will run far faster than OCR.

The huge difference between legacy OCR applications and glyph clustering comes when it is time to edit or clean up the initial text results. Legacy OCR applications will typically require brute force, linear editing where each error in the final text must be reviewed in turn. With glyph cataloging, the relationship between a glyph cluster and the associated text value only needs to be corrected or edited once. All of the text values for all occurrences of the glyphs in that cluster are updated for all images that have been processed to date – PLUS, the change in the associated text values persist even for as yet unscanned documents.

Comments are closed.