Some parties are producing thousands of e-discovery documents and emails all bundled in a single PDF. That is clearly not how those items were ordinarily maintained and it’s certainly not reasonably useful (F.R.Civ.P 34(b)(2)(E)(ii)). Here are reasons why all-in-one PDF productions are unreasonably burdensome on the receiving party:
Pages with No Text Are Invisible for Text-Based Analytics or Search
While all PDFs have a layer that displays visual representations of the pages in the file, not all of them have a searchable text layer containing the words, letters, and other characters used in the document. Furthermore, a PDF file can have some pages with text layers and some without text layers. This is crucial for lawyers who use text-based search and predictive coding software to review and search discovery: non-textual pages are essentially invisible to such software. With all-in-one productions, there can be hundreds or thousands of such hidden pages.
Unscrupulous parties could use this to minimize the chances of an opposing party identifying “hot” documents by using image-only versions of them embedded in consolidated PDFs that have text for some of the other pages. This becomes more serious as the number of documents in a PDF grows because detecting the issue can become less likely.
One way to see if a PDF you’re working on has a text layer is to try to click and drag to highlight some of the words:Another way is to search for a term you can see on a page and see if your search engine finds the document you’re viewing.
It May Not be Easy to Use OCR to Add Text Layers for Image-Only Pages
Parties who rely on optical character recognition (“OCR”) software to add text layers to image-only pages may not appreciate how OCR decides whether it needs to OCR a file. When OCR packages find a pre-existing text layer for any of the pages in a PDF they usually won’t perform OCR at all on the file. See the following graphic.
One way to avoid this is to (1) create an image-only PDF and then (2) OCR each page, as shown below.
The problem with this solution is that it takes computing resources to generate the image-only PDFs and then OCR them. Also, while OCR may provide text for pages that didn’t have text layers, it will also introduce OCR errors into the text, thereby potentially degrading the quality compared to pre-existing text.
Document Unitization Issues for Text Search and Analytics
The default logical unit for most text search and text analytics engines is the “document.” Search and analytics software typically assumes that one file contains one document. The default doesn’t work well when a PDF file is more like a box or filing cabinet than a “document.” The results are essentially meaningless because the terms can occur in completely different logical documents.
For example, if a search looks for “Term A and Term B,” a PDF with Term A on page 12 and Term B on page 2,412 will result in a hit. To compensate, searchers must use proximity operators that define how near search terms must be to each other, e.g., within a specified number of words, sentences, or paragraphs, making it more cumbersome to search.
Bookmarks Are NOT as Useful as Folder Names
A producing party’s use of bookmarks to provide a structure to the production is not as useful as having documents in folders. For example:
- Non-Sequential Export. Most standard PDF utility programs will not recreate the folder structure that one might assume by stepping through each page of a PDF from page one to the end and observing the bookmarks for each page. Bookmarks can be added after the pages are loaded, and bookmarks can be used to create a hierarchy that has an order that is different from the page-order. For example, the first bookmark could point to the last page in a PDF.
- No 260-Character Path Limit. While Windows is subject to the 260-character limit for the total path to a given file, PDFs can have nested bookmarks that contain far more characters. This makes it difficult to attempt to use bookmark text for folder paths.
- Non-Unique Folder Names. In computer operating systems, folders at the same level in a path must have unique names. However, in PDFs the same terms can be used to make bookmarks at the same level within the PDF. Using PDF bookmarks as folder names can result in combining pages that should have been associated in two or more documents or could result in the loss of pages that had duplicate folder names.
Lost Parent-Child Relationships
Most receiving parties are interested in the context established by parent-child relationships among produced documents, e.g., they want to know which attachments go with which emails. This can be difficult to discern when all documents are in the same PDF.
The best advice for requesting parties is to be specific about the requested form of production and use the meet & confer process to be sure you obtain reasonably useful productions.
For earlier postings on the use of the PDF format and folder name issues, see:
- “PDFs – Versatile Containers for E-Discovery Review” (does not suggest making an entire production in a single PDF) – http://beyondrecognition.net/pdfs-versatile-containers-e-discovery-review/
- “Why Embedding Referential Metadata in PDFs is a Good Idea” – http://beyondrecognition.net/why-embedding-referential-metadata-in-pdfs-is-a-good-idea/
- “REALLY Compressing PDFs…” – http://beyondrecognition.net/really-compressing-pdfs/
- “E-Dscovery: Using Path Names and Filenames to Create Folksonomies,” http://beyondrecognition.net/e-discovery-path-names-filenames-for-folksonomies/
To download a free, personal-use copy of Guide to Managing Unstructured Content, go to: http://beyondrecognition.net/download-john-martins-guide-to-managing-unstructured-content/