Migrating unstructured content can involve many moving parts. This can include substantial investments of time and money to extract and then load the files themselves as well as to quality control, aggregate, and supplement existing file attributes. Here are some points to consider when migrating content:
1 – Identify Relevant Stakeholders
What business units or functions have used the content in the past? Which ones would like to use it in the future? Does Records Management or Legal have concerns or requirements on the content?
2 – Do the Stakeholders Want the Content?
Just because an organization has files doesn’t mean it needs to keep or migrate them. Any thoughts of migrating content should start with the question, “Why do we want to keep this content?” One of the ways to discourage data hoarding is to designate who the “data owners” are and allocate some or all the costs of keeping the content to them.
3 – Migrate only Unique Content
If an organization has made a complete file and hash value inventory of all content in all its data stores, it can evaluate the level of duplication in a collection and target the movement of only truly unique files. There’s no point in moving more than one copy of a file and if it already exists in managed content, no need to move it at all.
4 – Eliminate ROT
Even if an organization needs some of a collection doesn’t mean it needs all of it. ARMA has published many articles on ROT in files – Redundant, Obsolete, or Trivial files. Take the opportunity to eliminate ROT before loading the content onto a new system. This can be readily incorporated in a classification normalization step during the migration. Eliminating ROT typically removes a substantial portion of a collection.
5 – Measure Accuracy of the Associated Metadata
Content migration involves not just moving the files themselves, but also the metadata kept in content management systems to track and manage that content. Before migrating content, assess the accuracy and consistency of that metadata. The most important type of metadata is file classification, but the accuracy assessment should also include whether other data elements extracted from the files are accurate.
When duplicate files are identified, there should be a way to normalize and aggregate the metadata for the different copies. For example, folder names can provide terms that are essentially a folksonomy of how earlier users viewed or classified the files. When multiple copies were kept on differently-named folder paths, those additional terms should be aggregated and migrated.
6 – Identify Shortfalls or Gaps in Current Tracking Info
Are there additional data elements that should be captured from certain document types to support changing business needs? Should existing data elements be normalized to aid in retrieval? A method for articulating and reaching agreement on business rules and data standards should be established before the project begins.
7 – Who’s Responsible?
When the migration succeeds or fails, who’s responsible? What are the measures of success?
8 – Resources & Timelines
What resources does the person responsible have to achieve the migration, both in-house and vendor-provided? What’s the timeline? Remember that at large scale, just transferring files can be time consuming and deplete existing bandwidth. Before deciding how to move files, develop metrics of different approaches by moving meaningful-sized samples. Two related points here:
De-duping & deNISTing before moving files can substantially decrease the volume to be moved.
People become so accustomed to moving individual files over network or Internet connections that they forget the option of copying the initial collection onto a locally-attached drive and then shipping or moving the drive to where it will be processed or loaded into the final target system.
9 – Migration’s Target System
The technical requirements of the target system that will manage the migrated content will determine what the potential deliverables are for the project. Usually there will be a range of possible deliverables, be sure to specify which ones, e.g.,
- Original native files, e.g., Word or Excel
- Image-Only PDF
- Image w/Text PDF
- Redacted PDF
- Extracted Text
Also, specify how any metadata will be loaded in the target system, e.g., tab-delimited files, XML, etc.
10 – How will Key Quality, Cost, and Timeline Data be Validated?
The best advice on migration projects is to trust but verify, and to verify early and often. For example, if timeline projections are based on the assumption of being able to move a petabyte of content in three weeks, how has that been validated? If it takes a week to move a terabyte of content, it is unlikely that a petabyte can be moved within the time allotted. Using sample data from the actual content collection can provide reality-based estimates.
Several common-sense validation techniques can be used to augment validation results without being reliant on just one migration tool. For example, each item in an inventory of the files in the existing collection should be accounted for in the subsequent processing, so that the beginning total should equal the counts for unique content files, duplicate content files, ROT files, plus system files. If it doesn’t, something’s either fallen through the cracks or been processed twice.
Another good check of the end-to-end process is for users to examine files from the starting collection and see if they can find them in the final collection by searching the classifications, extracted attributes, or special flags.
- “Checklist on Sources of File or Document Attributes,” March 23, 2017, http://beyondrecognition.net/file-document-attributes/
- “E-Discovery: Using Path Names and Filenames to Create Folksonomies,” March 11, 2017, https://www.linkedin.com/pulse/e-discovery-using-path-names-filenames-create-john-martin
- “Ending Unstructured Content Walkabouts,” Nov. 2, 2016, http://beyondrecognition.net/ending-unstructured-content-walkabouts/