In simplest terms, information security involves identifying and protecting information that could somehow damage an organization legally or competitively if it were misused. Achieving those objectives in unstructured content is far easier if the organization first classifies documents by document type and evaluates the types and levels of risk associated with each type. Once that is done there are many ways to protect that information.For structured content in databases or business support systems, sensitive information can often be identified by examining field and table definitions, e.g., if a field holds social security numbers, that is obviously sensitive and access to that field must be tightly controlled.
It’s a different story with unstructured content like that held in file shares, scanned document repositories, and content management systems. That content is called “unstructured” because it can contain virtually any type of file or document, and identifying sensitive content in it is not nearly as neat and tidy as it can be in structured content.
However large the challenge, classification and assessment are vital. If individual files are not properly classified and identified as containing sensitive content, an organization can’t answer basic InfoSec questions like:
- When can we dispose of sensitive content so that it no longer poses a risk? If we can’t classify it, do we just keep everything forever?
- What level of security is needed on drives where the content is stored (e.g., is encryption necessary)?
- Who should have access to it?
- Should we log such access?
Furthermore, without fact-based assessments of what is sensitive, why it’s sensitive, and what type of documents are involved, an organization cannot comply with basic information security best practices, such as:
- Provide access to sensitive information only to those who need it.
- Provide levels of storage security commensurate with the risk.
- Dispose of sensitive data when it is no longer needed.
Without classification and risk assessment, access is invariably too broad: too many people are given too much access to much content. (Questions for which we don’t have answers: Was Snowden that great of a hacker or was he given access to far more content than he had a need to see? To the extent hacking was involved, was it made far easier because of the number of people with access?)
The Need for Document-Type Classification and Risk Assessment
Setting security at the level of an entire collection is clearly not the optimal approach, and organizations have tried two different approaches to document-level classification, manual classification and text analytics. Neither of them has yielded satisfactory results.
Manual document classification schemes don’t work because of the difficulty in achieving compliance with classification instructions and the inherent inconsistency of subjective classifications even with full compliance.
Automated text analytics approaches use techniques like customized taxonomies to attempt to identify sensitive content. These avoid the compliance issue of manual classification but are expensive and time-consuming to create, generally imprecise, and difficult to update in the face of ever-evolving content. One of the biggest problems with text analytics is that many files or documents don’t have sufficient good-quality text for text analysis, either because PDFs were saved from applications that didn’t embed text or because of paper scanning or faxing operations. In some organizations, 30% or more of their documents do not have text or the sensitive information may not have associated text, e.g., forms completed in handwriting.
New Classification Technology. Visual classification is a new risk classification technology that avoids the limitations of manual and text-based classification. It begins by clustering native electronic files and scanned or faxed documents based on their visual appearance. Because all documents have visual representations but may not have analyzable text, visual classification is far more comprehensive than text-based systems. It is definitive and classifies all documents, not just documents with good quality text.
Visual clustering is automatic and the clusters are self-forming, meaning there are no rules to write, no exemplars to select, no upfront work to start the clustering. The number of clusters are generally less than one percent of the number of files or documents examined, and by evaluating the largest clusters first, well over 99% of the documents in a collection can typically be examined in less than three days. Decisions made on the examined clusters need not be repeated when more documents are later added to the clusters.
Because of the low time demands on reviewing visually-similar clusters, it is feasible to have more than one risk specialist working with other domain experts to perform the initial cluster evaluations. The whole team can examine the same documents on a shared screen or monitor ensuring consistency and consensus across the organization. The initial evaluations are for disposition, risk, and document-type designations:
Disposition. All documents in a cluster are alike, and reviewing one or two documents per cluster often suffices for making disposition decisions about the entire cluster. If documents in a cluster serve no business, regulatory, or legal purpose they can be disposed of, ending any ongoing risk associated with maintaining them, and freeing up expensive storage infrastructure.
InfoSec Risk. Risk assessment specialists can view one or two documents per retained cluster to determine if documents in that cluster are apt to pose any sort of information security risk and they can tag those clusters identifying the type of risk in each cluster, e.g., health information might be assigned a HIPAA tag or credit card information could be assigned a PCI tag.
The cluster risk evaluation is augmented by regular expression searches (e.g., search for social security numbers by looking for NNN-NN-NNNN where “N” is a digit) and dictionary-based searches (e.g., search for any term used to describe proprietary chemical formulations) to find sensitive information that may occur in retained clusters. In other words, there are two ways to identify sensitive content – through cluster review and by text-based searching. Those two techniques provide the most comprehensive approach to identifying sensitive content.
Designating Document-Types. Content that is to be retained is also assigned a document-type label based on a three-level document-type tree that is generally setup with levels for business unit, document type, and sub-document type. These document-level classifications can be used not only to set retention periods but also to control access to all content, not just those with explicitly identified sensitive information. For example, in an energy company an HR manager should not have access to seismic data or right-of-way easements for pipelines, and a drilling rig roustabout should not have access to managers’ performance reports.
Benefits of DocType Classification and Risk Assessment
Once files and documents have been clustered and evaluated for retention, risk, and document-type designations, the organization is able to fully answer the basic information security questions and adhere to InfoSec best practices:
- It has disposed of content that serves no business, regulatory, or legal purpose.
- It knows what types of documents it has and how long they should be kept.
- It knows which types of documents require special storage precautions.
- It knows who should be able to access which documents.
For a more detailed explanation of visual classification, see Technology page.
If you would like more information on how visual classification can help your organization achieve document type risk assessment and put you in control of information security for your “unstructured” content, please contact us at IGDoneRight@BeyondRecognition.net.