In the 1986 film Crocodile Dundee, the character Michael J. “Crocodile” Dundee uses the term walkabout to describe disappearing on his wife for 18 months and shares the perhaps predictable outcome:

“I was sorta married once – nice girl, good cook . . . Then I went walkabout, and when I came back, she’d gone.”

In Australia, the term walkabout is also used to describe the right of passage for adolescent Aboriginal Australian males who journey into the wilderness for up to six months to make the spiritual and traditional transition into manhood.

Organizations sometimes do walkabouts on their unstructured content. They leave it in place to fend for itself for extended periods or they wander about in the technological wilderness reactively dealing with specific pain points as they occur. No particular plan but predictable outcomes: Files get lost or can’t be found, unauthorized people gain access to them, most files don’t get destroyed when mandated, and overall costs, inefficiencies, and risks rise.

In my book, Guide to Managing Unstructured Content, I provide a simple four-step plan (“RCAV”) to end unstructured content walkabouts:

  1. RRationalize the unstructured content: inventory where all the files are.
  2. CClassify the content: determine what type of documents are in the files.
  3. AAttribute the content: identify and extract the high value variables from each document type.
  4. VValidate the process: ensure that all content has been processed and that the results are accurate.

Regardless of the systems being used for each of these steps there are common pitfalls that need to be avoided. Here are just some of the generic issues that can affect all systems:


  1. 260-character Windows Path/Filename limit. There are several ways that files can be placed on folder paths where the total path and filename exceed 260 characters. Most Windows software won’t index or find such files. They are lost in place.
  2. Duplicate Files. Large numbers of files are duplicates, consuming unnecessary resources. Duplicates should be identified before collection.
  3. Duplicate Emails. Each email client system can store messages in different formats making it difficult to use hash values to identify duplicates. This can happen even with different versions of the same email client software. Email messages need to be normalized to standard format before duplicate identification.


  1. Image-Only Files. Text-based classification algorithms don’t work with image-only files. I have worked with corporate collections containing over 40% image-only files.
  2. Language Dependence. Text-based classification algorithms are typically language-dependent. Organizations may have to develop a unique template/script for each language for each document type depending on the approach used.
  3. Morphing. Document types morph 10-15% each year within types and new types come onstream each year.
  4. Incorrect Document Boundaries. Many PDFs and TIFs contain multiple documents per file. Classifying at the file level ignores and hides the embedded documents.


  1. Classification-Dependence. Extracting desired document attributes will generally be only as good as the underlying classification. Any classification errors quickly cascade into multiple attribute extraction errors.
  2. Non-textual Attribution. When pulling desired attributes from within documents, organizations should look for opportunities to pull non-textual graphical elements for things like signatures.
  3. Normalize. Data values extracted from business documents should be normalized to overcome the expected normal variations and to enable integration with multiple systems of record.


  1. Audit Trail. Logs should permit auditing all classification or attribution values going from original sources and duplicates all the way through to specific file, document, page, and page coordinates for each extracted attribute.
  2. Performance Guarantees. The agreements under which files are collected, classified, and attributed should have written service terms ensuring accuracy well above 99%.

For more specifics on these and other common pitfalls in managing unstructured content as more information on the RCAV process, you can download your copy of the Guide to Managing Unstructured Content at https://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/


Comments are closed.