In enterprise content management, tags are used to help find tagged documents. Tags can provide terms that don’t appear on the face of the documents or they can highlight the importance of terms that do appear. Without tags, full-text searching can result in returning large numbers of irrelevant hits while missing key documents that don’t contain a required search term.
Tagging systems employing taxonomies permit only terms in the taxonomies. The alternative approach of permitting users to enter whatever terms they find useful is described as using. Collecting tags from multiple users who have interacted with the documents for multiple business reasons greatly increases the chances of finding relevant content and being able to interpret their significance in context.
Either way, asking users to help manage content by tagging or classifying their own documents usually fails because they simply don’t take the time. One way to obtain user-provided tags is to use the terms that appear in the folder path and filename for the files. To the extent that users can name their own folders this represents a de facto folksonomy.
Users may not be interested in tagging per se, but they do have to put the files somewhere, and the folder structures they create are, after all, hierarchical arrangements of significant terms. Where someone places a file and how they name files is a good indicator of how they think about it. One advantage of the Windows files structure is the 260-character length enforces a discipline that prevents completely unfettered assigning of tags.
Using folder path/filenames can be particularly useful when duplicate documents have been collected from the files of multiple people because of the multiple insights gained. Here are two examples of where folder information could be very useful:
- In an antitrust case, a witness may deny that he viewed Company X as a competitor, but the files he had on Company X may have been filed under the path “Competitors/Company X”
- In a products liability case, a witness might maintain that they never knew that certain emails contained notices of potentially deadly defects but those emails were under the folder “Potential Liability Issues.”
One suggestion to make paths easier to read and to lead to more effective keyword indexing is to strip out underscores, hyphens, dots, and other special characters and to insert spaces where capitalization appears. For example –
“RetailCompetitors/CompanyA/PriceSheets.xlsx” would be converted to “Retail Competitors / Company A / Price Sheets xlsx.”
The parsed path/filenames for each custodian who had a duplicate can be placed in a searchable field along with the name of the custodians. Having a dedicated field lets users search for the tag/folder terms without having to search the general full text of the documents.
For earlier posts on this general topic, see:
- User-Augmented ECM Classifications: http://beyondrecognition.net/user-augmented-ecm-classifications/
- What the Periodic Table Teaches about ECM Classification: http://beyondrecognition.net/what-the-periodic-table-teaches-about-ecm-classification/
For more information about managing unstructured content, you can download my book, Guide to Managing Unstructured Content, for free personal use at http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/