A popular quote on data transfer from the 1980’s reads:
“Never underestimate the bandwidth of a station wagon full
of tapes hurtling down the highway. ” – Andrew Tanenbaum
Tanenbaum’s point remains true today: the physical transfer of media containing information can be faster than using the Internet. Of course, today we use FedEx and removable drives, not allegorical station wagons and tapes.
To examine why true Tanenbaum’s statement is true today, let’s consider an organization that wants to move a petabyte of information from one point to another network or to the cloud. The following steps would normally be considered:
- Read the data from the current drive
- Transfer the data over the current LAN
- Transfer over a WAN or the Internet
- Transfer across the target LAN, and
- Write the data to the target drive
The bottlenecks usually occur transferring data over LANs and the Internet where speeds are measured in megabits per second (Mbit/sec) */1. Here are some estimated times to transfer a petabyte over normal-capacity LANs and Internet connection speeds, assuming full capacity was available for the transfer: */2
Of course, organizations typically don’t have gigabits of spare capacity lying about, and organizations can’t shut down day-to-day operations just to migrate some data. Also, note that actual achieved throughput may be much slower and require far longer times than shown above. For example, transferring large numbers of small files can cause significant disc I/O operations that substantially degrade performance.
Achieving the types of speed theoretically available for high throughputs could involve budgeting for, acquiring, installing, and testing higher capacity network components – and they be unneeded after the transfer. To test your current network and internet capacity, time how long it takes to transfer just a terabyte of information at different times of the day or week.
The Direct Attach Alternative
There is a far less time-consuming and much more affordable solution: Attach drives directly to the current server and copy the data without going over a LAN or the Internet. Once the files are copied to the attached drive, ship the drive overnight and attach it on the target system. Note that major cloud providers have procedures in place to accept physical shipment of drives.
Not all drive connection interfaces are equal. Here are some data transfer capacities for different interfaces, using comparable assumptions: */2
Of course, even on an individual server, there can be resource contention as other applications compete for resources like memory, drive read, and drive write functions, and there can be other hardware limitations like drive write time. However, at least the network and internet resource contention issues have been removed.
Big Difference: Think Parallel
One of the big differences with the direct-attach alternative is that organizations can perform parallel copying by attaching more than one drive per server or attaching drives to more than one server at a time, further slashing overall times to project completion.
Using direct attach drives and then transporting them will often considerably shorten the time it takes to complete large-scale data transfers. There are many potentially complicating factors but copying to direct attach drives and then shipping the drives should always be considered.
*1/ To translate megabits per second to megabytes per second, divide Mbit/sec by 8.
*2/ The calculations are from the online tool at https://techinternets.com/copy_calc with only a 10% administrative overhead, which could be fairly low.
*3/ Thunderbolt specs taken from https://www.cnet.com/how-to/usb-type-c-thunderbolt-3-one-cable-to-connect-them-all/. Throughput estimates based on 10% overhead which may be optimistic.
Tanenbaum Quote: https://en.wikiquote.org/wiki/Andrew_S._Tanenbaum
For more information on managing unstructured content, download your free, personal-use copy of Guide to Managing Unstructured Content at –http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/
The Guide presents the RCAV model for analyzing unstructured content issues:
The discussion on rationalization includes sections on inventorying and making single-instance copies of unique content.