SSAN_Dice_250Background. The Electronic Discovery Reference Model (“EDRM”) is an e-discovery industry standards setting group, and the EDRM Enron Email Data Set v2 (“EDRM Data”) is a collection of documents originally gathered by the Federal Energy Regulatory Commission (“FERC”) as part of its investigation of Enron’s energy trading practices and then made public by it. EDRM Data is a reworked version of the original documents, with a label added to each email that reads,

“EDRM Enron Email Data Set has been produced in EML, PST and NSF format by ZL Technologies, Inc. This Data Set is licensed under a Creative Commons Attribution 3.0 United States License <>. To provide attribution, please cite to ZL Technologies, Inc. (”

EDRM served as a direct download point for the EDRM Data for a period of time and later moved it to Amazon Web Services for downloading.
Breach Discovery. While working with the EDRM Data that we downloaded from the EDRM website, BeyondRecognition discovered that there were over 7,500 instances of unredacted social security numbers, credit card numbers, dates of birth, home addresses and phone numbers – a startling breach of privacy. Most of the data breach victims were Enron employees, but the victims also included spouses or children of the employees as well as third party contractors.


There are some lessons that can be drawn from these breaches of personal privacy:
PII Protection Step. The primary lesson is that regardless of how documents are selected for production there needs to be a separate and final stage to screen for personally-identifiable information (“PII”) such as social security numbers. Once PII information is located it should be redacted or, as a less attractive alternative, the producing party should obtain a protective order on those documents and apply a special legend to them prior to production.

Redaction. The reason for favoring redaction is that it does the best job protecting the PII of the individuals mentioned in the documents. There is no reason why PII should be viewable by the opposing party’s lawyers, paralegals, database administrators, hosting providers, consultants, or ESI processing vendors. Redaction of all PII produced to another party also furthers the general privacy policy interests behind section 205(c)(3) of the E-Government Act of 2002, Public Law No. 107–347, that led to the adoption of Rule 5.2 of the Federal Rules of Civil Procedure, Rule 49.1 of the Federal Rules of Criminal Procedure and Rule 9037 of the Federal Rules of Bankruptcy Procedure. Those rules require the redaction of PII on documents filed with the indicated courts or the filing of those documents under seal.

Redaction also takes PII risks off the table as to post production breaches – the federal and state laws pertaining to disclosure of PII won’t be triggered by subsequent data breaches involving the redacted documents.

Assign Responsibility. Another lesson is to assign PII responsibility to a specific person – otherwise it can be easily overlooked by litigation teams where different team members are each focusing on their own specific issues. Many research teams have poured over the EDRM Data for years but nobody was specifically tasked with identifying and removing PII. For example, the NIST-sponsored Text Retrieval Conference (“TREC”) Legal Track for 2010 and 2011 used that data set and, for one or both years, included teams from the United States, Canada, Australia, Greece, India, Israel and China.

BeyondRecognition is no exception – we had worked with the data set for various purposes over several months without checking for PII. We only made the discovery as part of testing our mass redaction tool – we ran it out of curiosity never really expecting to find significant numbers of social security numbers. When you’re looking for PII, it’s not hard to find at least some of it, and what you find can inform how much further effort is required.

Use Best Practices. A third lesson is to create a standard, best practices approach to use on every case. While each case may present unique problems, there will be common elements and there ought to be a baseline approach so teams are not continuously reinventing the wheel. The baseline can include a number of techniques for identifying documents containing PII.


Ask Custodians. The most fundamental (and non-technical) method is for the producing party to use a custodian interview checklist so that custodians are always asked about the PII that might be found in their documents. This is especially important considering that some of those documents or files may not be amenable to text searching or text analysis.
Simple Searches. The simplest computer-based technique and one that is available in all e-discovery review platforms is to have stored queries that are run against all documents prior to production, e.g. search for “social security number” or “SSAN” or “SSNo” or “SS#” or “Soc Sec No” or “Visa” or “M/C”. Producing parties familiar with the types of their documents that typically contain PII (e.g., employment application forms) can develop stored searches that identify those document types.

Text String Searches. More advanced searches may involve the use of text string matching, e.g. search for “###-##-####” where the pound or hash symbol (“#”) represents any digit from 0 to 9. These so-called “regular expressions” (or in programmer-speak, “regex”) can be run on many systems. See, e.g., kCura’s documentation on how to search for social security numbers in Relativity:

Visual Similarity. Another approach is to use visual similarity technology from the beginning of case to gain an assessment of what is relevant in the case and what might contain PII. Documents stored in different types of files, e.g. Word, PDF, or image-only scanned TIF, are all clustered on what they look like, not on the type of file that contains the document information or the amount of text that is discernible in each file.

– Up Front Review. Document collections with millions of documents will typically resolve to a few thousand document type clusters, and one or two examples of each cluster can be examined to determine (a) are the documents in that cluster likely to be responsive and (b) do they appear to contain PII. Taking an afternoon or a day to review the cluster exemplars can provide an excellent start on screening for responsiveness and for PII. In the EDRM Data, for example, there were discrete clusters for IRS Form 1040’s and another for employee tax forms.

– In Combination. Visual clustering can be used as a free-standing technique at the outset of the case, as described above, or can be used once other techniques have identified PII documents. Clusters found to contain PII documents can be reviewed to determine if other documents in the cluster contain PII even if those documents were not located by that other technology. For example, there could be documents in a cluster for which there is no searchable text (e.g. an image-only PDF) but they will still be grouped with other visually similar documents that were located via text searching or text analysis.

Manual Coding. In the absence of visual similarity technology, document collections can be manually coded for document type and the document types where some of the documents have been found to contain PII can all be examined. However, there can be consistency, granularity (having specific-enough document types), cost, and delay issues with that approach, and the whole coding process exposes PII information to yet more people.


Redaction. Once documents with PII are identified, the next question is, “How they will be treated?” Depending on the technology available to the producing party it may be possible to have automated redaction en masse for all social security numbers and to generate redaction logs documenting document number, page and terms redacted. Depending on the system, it may also be possible to redact zones on certain document types without requiring operator involvement. For example, on IRS tax form 1040’s, the text entry area for social security numbers could all be redacted even on forms that were filled in by hand. Both types of redaction are available through BeyondRecognition – see

Protective Order & Legends. If some automated process is not available, the only economical alternative for large numbers of redactions may be to obtain a protective order for such documents and to apply a label or legend to such documents that restricts their use.


Regardless of the effort expended or the technology used, it is possible that some PII may escape detection and may be produced. To a certain extent all a producing party can do is to use a good faith, proportionate effort to protect PII. However, here are some further considerations:

  • Data Breach Plan. It can be confusing trying to figure out the federal or state agencies to which you should give notification in the event of a data breach. The best idea is to have some sort of plan in place ahead of time. The FTC and the state attorney general will probably be high on the list of agencies to notify, although note that there may be more than one state attorney general involved.
  • Stop the Bleeding without Spoliation. If you have a data breach, give urgent attention to identifying and remediating the breach – ignoring the problem will only make it worse. In any data breach you should also seek legal advice on what your obligations are to preserve evidence and this may involve forensic-level preservation.
  • HIPAA. Law firms handling health information for healthcare clients, should most likely enter “business associate contracts” with them as defined by the HIPAA regulations. Data breach notification obligations for health data may well include the US Dept. of Health & Human Services Office of Civil Rights.
  • Credit Card Information. Law firms handling credit card information on behalf of a client who is a merchant, merchant acquiring bank, or credit card company should be aware of the data breach notification obligations the client has by reason of their credit card handling status.
  • Insurance. Don’t assume that a professional liability policy will provide adequate protection – costs relating to forensic analysis, notification, and compensating data breach victims can be significant. You may well want to obtain additional data breach coverage in the form of riders or supplemental policies.
  • Encryption. Proper data encryption may be a sort of “get out of jail free” card. If a properly encrypted data storage device is lost or stolen, the fact that it was adequately encrypted can remove reporting obligations.
  • Training and Auditing. The best systems and policies in the world cannot overcome having poorly trained or poorly motivated people using them. Train your people and audit their compliance on a regular basis.

Conclusion: The bottom line on all this is that protecting PII is serious business with serious financial and reputational risks. It only makes sense to have well thought out processes and technologies in place to fully protect the PII you handle.

-John Martin-

Post script:

BeyondRecognition has reported the data privacy issues in the EDRM Data to EDRM, FERC, Amazon Web Services who currently distributes the data set, the FTC (Reference Number 45277727), and the Texas Attorney General. We have offered lists of those social security numbers to the latter two agencies to aid in notifying the data breach victims and monitoring their SSAN accounts. As of April 30, 2013, that data set was still available for download from Amazon web services via a link from

Historical Notes:

FERC’s initial release of Enron data in 2003 had PII issues that were supposedly cleaned up after Enron objected – see the ABC News story on the early release:

However, digging into Google search results shows there was at least one blogger who had noticed PII breaches in 2006, although it isn’t clear which data set the blogger was using. See the SecuriTeam blog

Comments are closed.