Speed matters. This is particularly true when enough seemingly instantaneous things are done at large scale. The few seconds it takes to copy a single file can become weeks or months when enough files are involved and the target drive is remote from the source drive. Slow data transfer time can postpone or even kill data migration or decommissioning projects.
The logic of normal file transfer software is fairly simplistic. It starts at the first file, copies and transfers it, then goes on to the next file and repeats. There is a much wiser and more efficient approach: read everything, then decide what to transfer.
Here are five steps to take to speed file transfers of enterprise-scale data collections. As a general matter, data transfer speeds with this approach can be anywhere from two to eight times faster than brute force approaches. Results will, of course, vary with several factors including the read speed of the drives on which the source data is being held, the available throughput on the network over which the files are being transferred, the write speed of the target drives, and the nature of the content.
- Inventory. Rather than initiate a brute strength transfer of files, inventory them so you know what you have.
- Dedupe. If a collection contains duplicate files, you don’t need to repeatedly transfer the same sequence of bits for each instance of each duplicate. Dedupe the files first and only transfer one unique copy of all the instances. Deduping can cut out as much as one-quarter to three-quarters of required transfers.
- DeNIST. It may not be necessary to transfer software executables and related documentation. For example, if files are being transferred for purpose of an e-discovery review in litigation, there is no need to transfer files created by software publishers. That will just drive up the costs of downstream storage and review. The NIST NSRL *1/ contains hash values of known software-related files and can be used to exclude software files from transfer if desired.
- Compress. Compressing the data stream using standard compression algorithms boosts effective throughput and decompressing transmitted files on the target drive yields files that are bit-for-bit identical to originals.
- Multi-Thread. For true enterprise scale collections measured in petabytes, the transfer process should be multi-threaded so that more than one file at a time can be read, compressed, transferred, and decompressed. With appropriate tuning, a 10 Gigabit channel could transfer a petabyte in a little over 11 days.
When transferring large numbers of files, interruptions are fairly common. Use processes that can resume after an interruption without having to start over, and schedule or throttle back transfers when network resources are being used for other essential tasks.
Defensibility, Transparency, & Security
The most basic requirement to establish the defensibility, transparency, and security of any file transfer process is to create a complete transfer log of all files encountered, including the:
- SHA hash for each file
- Full folder path and file name of each instance of each file
- File size in bytes
This log becomes the audit trail or chain of evidence for the files transferred. This basic transfer manifest can be used in several important ways:
- Reduping. The transfer log can permit the repopulation of duplicates into the original folder structure if desired. It also shows where each instance of each file appeared in the original collection.
- Searchable File Paths/Names. The path/file name information can be parsed to make individual terms searchable, e.g., by removing underscores, slashes, or periods. This information can then be made accessible to end users by placing it in whatever content management or ediscovery review system is going to manage the content.
- Validating Drive Space Filled vs. Transferred. A complete log of all files on the source drives enables auditing the files transferred and the total bytes transferred. This can reveal that some files were not transferred for reasons like having non-standard characters in their names or for being on path/filename combinations that exceeded Windows’ 260-character limit.
*1/ NIST NSRL: http://www.nsrl.nist.gov/
For more information on managing unstructured content, download a free copy of Guide to Managing Unstructured Content for your personal use at http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/.