In the IT and Information Governance (“InfoGov”) worlds, organizational data is usually thought of as being structured or unstructured. This posting looks at differences between the two, and in the next posting I’ll suggest a best practices approach to thinking about and managing unstructured content.
Overview of Structured vs. Unstructured
Information professionals often prefer dealing with structured content like that contained in databases. They know what type of information is stored in them and there are defined relationships among the tables, columns, and rows.
The same professionals are often much less comfortable trying to manage “unstructured” content like that contained on file shares. They often think of unstructured content as if it were a residual, catch-all category with no discernible rhyme or reason to what is contained in it. At a macro level that may be correct – there is no necessary connection between any particular file of unstructured content and any other file stored there. However, that is not to say that there is no structure.
Here is an overview of major differences between structured vs. unstructured content:
The Rationale for Enterprise Content Management Systems
Content management systems represent an attempt to provide the advantages of structured content for unstructured content, i.e., to provide ways to find and work with files other than having to rely on just full text search. In the next posting, I’ll suggest a best practices approach to thinking about how to manage supposedly unstructured content.
In the meantime, note that the high-level differentiator between structured and unstructured content – not knowing what’s in unstructured content – goes away if there is consistent document type or file classification. We do know in a general way what is in particular document types and what attributes they contain. Some examples:
- Invoices include the names of the buying and selling parties along with quantities and descriptions of items sold. While there are theoretically an infinite number of formats for invoices, in practice there are a manageable number of invoice formats.
- Well logs are created by one of a select number of companies capable of creating the logs, and the logs will be one of a limited number of formats, enabling us to determine where the well name, API number, date, and GPS coordinates are located.
- Construction change orders will be issued using either a form or a template and the orders will, once again, look alike and contain specific information in particular locations.
In other words, the lack of uniformity across the entire collection does not keep subsets of stored documents from being quite structured within document types. Consistent classification and attribute extraction within document types provides the most precise and flexible indexing and management of seemingly unstructured content.
Next posting: a suggested best practices approach to unstructured content.
Guide to Managing Unstructured Content
For more information on managing unstructured content, you can download your free, personal-use copy of our Guide to Managing Unstructured Content at: https://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/