The legal and reputational risks associated with mismanaging PII and other sensitive data are well known, and one of the most challenging areas is managing PII in “unstructured” content – the files and documents found on file shares, local drives, and removable media.
You know that PII (or PCI or PHI or IP) is in there, but exactly where? You can’t lock down and manage all content as if it was PII.
The good news is that practically all PII or PCI or PHI in unstructured content is in completely predictable places, e.g., social security numbers are on W-2s, 1099s, loan applications, and credit check authorizations; PHI is contained in medical histories, lab results, and prescriptions.
The key is to be able to designate specific document types for each file or document stored in your “unstructured” content. There are many challenges in doing this, including the sheer scale of collections spanning years or even decades with multiple file types, multiple sources, and multiple file creation or acquisition technologies.
But there is more good news: visual classification groups or clusters files and documents consistently and automatically across file types. Scanned or faxed versions documents are classified the same as the native file or PDF versions.
Once files have been classified there are a number of actions that can be taken to protect PII (or PCI or PHI):
- Delete document types with no ongoing operational, regulatory or legal value – if they aren’t there they can’t be stolen or misused! Eliminating duplicative or unneeded files can dramatically reduce the number of files and the drive space required to store them.
- Apply granular retention schedules and make provision for a fixed point in time disposition, i.e., keep the over-preservation problem from recurring.
- Store document types in the type of secured storage warranted by the type of content they contain.
- Restrict access to specific document types to those with a legitimate need to see.
- Make redacted copies for certain uses, using either automated text-based redaction or automated cluster-based zonal redaction.
Related Posts: Documents ARE Structured, Just Heterogeneously
42-Second Video Clip: Classify and Label
Free Whitepaper: “Boosting Detection and Protection of PII in Unstructured Content”
Prezi used by John Martin at his presentation to the New York City ARMA chapter: “Detecting and Protecting PII“