The decreased costs of cloud storage now enable companies to migrate back-up tape data to the cloud to achieve lower overall operating costs. Companies providing the software and hardware to read tapes and upload their contents have essentially virtualized the existing backup-to-tape paradigm – they’re performing the same functions, just using the cloud as the backup location instead of tape jukeboxes or cartridges. Whatever would have been on each tape simply gets pushed to the cloud.
This posting suggests a new approach of just backing up content that matters. It details how BeyondRecognition achieves this by:
- Eliminating redundant or duplicative files before uploading them,
- Screening files so that files with no organizational value are defensibly disposed of prior to upload, and
- Permitting clients to apply document-type labels for applying granular records retention schedules.
In the following sections we describe the various processing options and how to move forward with evaluating BR technology.
BR’s Basic Tape-to-Cloud Technology (the “Vanilla” Option)
This is how the basic BR-to-Cloud process works:
- A BR Restoration Server is connected locally to the present tape backup system (benefit: no data transmission or upload costs).
- Tape content is restored to the Restoration Server.
- Another BR server, the Collector, calculates and records the SHA hash values for each of the files and logs each file.
- The BR Collector identifies known system files like software executables, and updates the log, but does not move copies of system files onto the Collector. The list BeyondRecognition uses is an expanded version of the software file list published by the National Institute for Standards and Technology (“NIST”).
- The BR Collector moves one copy of each non-system file with a unique hash value onto the Collector and updates its log to indicate all the locations/times that that hash value had been encountered.
- After the Collector has copied all unique, non-system files, the restored contents are removed from the Restoration Server and another tape is restored at which time steps three through six are repeated.
- After the tapes have been processed, the unique, non-system files are uploaded to the Cloud, along with the logs detailing where and when the files were originally located. In the event a server or other storage device has to be restored from the cloud backup data, the non-system files could be backed up from either the Collector Server or the Cloud, using the logs to determine which files needed to be backed up to which location.
Savings: Given the high level of duplication from tape to tape, storing only single instances of files with unique hash values takes far less space than storing largely duplicative data sets. Actual metrics will vary depending on how many tapes are stored to cover which periods of time, but it could easily be under 5% of the space required for storing complete backup copies. Furthermore, the bandwidth to upload the files is also greatly decreased, and those savings accrue each time an update is sent to the cloud.
The graphic to the right presents the simplified case where data is doubling every 18 months. By identifying only content files that are new on each tape there is a dramatic drop in the total volume uploaded each month. In actual practice there would be daily and weekly backups in addition to monthly and the total number of tapes being held could vary, but the overall savings are generally consistent across a wide range of variables.
BR’s Intermediate Tape-to-Cloud Migration: Adding Visual Classification (the “Chocolate” Option)
The Intermediate or Visual Classification version of BR’s tape-to-cloud offering performs the Basic steps described above to remove system files and duplicate files, but adds a visual classification filter.
Visual classification groups visually similar documents based on visual representations of them. Unlike other file classification systems BR does NOT use text to initially cluster or group the documents. The process works on scanned or faxed documents as well as native files and files saved to PDF.
In the Visual Classification version of the process, the BR Collector Server would cluster visually similar files. This is what one set of visually-clustered documents looks like from oil & gas documents:
And here is what a set of visually-clustered mortgage documents looks like:
While documents from different industries or sectors will cluster differently, the key point is that subject matter experts can quickly review the clusters to determine if they have any ongoing business, regulatory, or legal value. It they don’t they can be tagged for defensible disposition before being uploaded. In some collections up to 70% of the files can be deleted because they are not “records.”
BR’s Intermediate Tape-to-Cloud Migration with Document-Type Designations Added (The “Chocolate plus Sprinkles” Option)
Once the BR server has clustered documents based on their visual similarity, subject matter experts can use a BR interface to designate document-type labels for each cluster. Associating document types with document clusters enables determining where to store files, and who should have access to them. Document typing also enable the use of granular retention schedules, and greatly improves the specificity of searching.
This is how document type designations could be added to the visual clusters displayed earlier:
Here are the document type designations that might be applied to the mortgage industry documents:
PII. Another capability that can be applied at this stage is to indicate which document types would typically contain PII and then afford those documents appropriate levels of security.
BR Complete Tape-to-Cloud Processing: Attribution and Content-Enablement (the “Strawberry” Option)
The complete BR process also adds the ability to extract document attributes from documents being retained, e.g., to capture the loan applicant’s name and social security number from a loan application.
Being able to extract specific document attributes that are unique to each document type can greatly increase the informational value of those documents, making it much easier to integrate them into business processes. The resultant fielded information makes it possible to perform many quality assurance steps programmatically rather than manually, e.g., the program could check to make sure that the loan applicant’s social security number is the same on all the pertinent documents in the loan file.
To extract attributes, document analysts simply click and drag boxes around the document image zone containing the desired information. They have a number of delimiters that they can use to specify the exact content within the one and a number of filters to use to format the data to a common format, e.g., dates. Clicking and dragging on one document in a cluster serves to extract those attributes from all documents in the cluster, even those added to the cluster at some future point in time.
Here is an example of attribution from a loan file document:
This is what it might look like on a well log in the oil & gas industry:
For more information on how the BeyondRecognition approach to migrating tape-to-cloud, contact BR at IGDoneRight@BeyondRecognition.net.