Simpson’s Paradox is a kind of statistical brain teaser that provides lessons on text analytics and choosing the best tools to work with enterprise content. The “paradox” is that sometimes trends that seem apparent when data are analyzed as separate groups become reversed or disappear when the groups are combined. An example of Simpson’s Paradox […]

Read More

Implicit biases – those that we form and use without explicit consideration – can wreak havoc on achieving critical goals. One such type of bias is especially damaging when designing file classification systems – confirmation bias. That is the “…tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting […]

Read More

The Data-Information-Knowledge-Wisdom (“DIKW”) model is a useful for examining how well an organization is doing in deriving value from its unstructured content. In his book, Too Big to Know,* David Weinberger credits Russell Ackoff, a leading organizational theorist, with making a pyramid-shaped depiction of the DIKW model in a 1988 address to the International Society for […]

Read More

The central theme of David Weinberger’s book Everything is Miscellaneous* is that no single method of classification serves all purposes, and it is a concept worth considering when designing classification schemes for enterprise content management (“ECM”). One example of a classification scheme that he uses is the well-known periodic table which arranges basic elements in […]

Read More

In Everything is Miscellaneous, David Weinberger points out that no single classification system will necessarily best serve all those who use the classified content, and he points out several tools used by popular websites to let individual users create and share what they consider to be significant information. Many of those tools could be applied to improve the […]

Read More

Sometimes a large percentage of files found in unstructured content locations like file shares and ECM systems were actually created by database-driven business systems. These documents are essentially filled-in templates populated with specified database elements.  Whether stored as PDF or TIF, these computer-generated files are completely redundant to information stored in the database and could […]

Read More

The usual approach to classifying files or documents in an enterprise collection of unstructured content is top-down: determine what the classifications should be and then write rules or scripts on how to place individual files in the predetermined classifications. This presupposes a comprehensive knowledge of what’s in a collection and what attributes can be used […]

Read More

Imagine the internet with great search functionality but no hyperlinks. You could locate any individual page or at least have it included in extensive search results, but then you’d have to conduct other searches to find related pages, even on the same website. Not very useful, right? The point is that text search functionality alone is […]

Read More

Positional word frequency involves identifying how many times individual words appear at the same relative locations within the pages or documents in a collection. Positional word frequency solves major problems that occur when performing three basic functions involving unstructured content: Classification Attribute Extraction/Coding Search Without positional word frequency, low-value words can cause clutter in text […]

Read More

Accuracy, Fitness, Velocity, & Proportionality Data are used to represent real world events or objects and are collected to serve multiple purposes, e.g., to justify tax deductions, memorialize customer orders, or identify customer trends. Individual data points could be the price paid for an item, its weight, color, and dimensions, or they could be the […]

Read More
The BeyondRecognition Network

the-beyondrecognition-network-of-companies