PDFs with embedded metadata can achieve many useful functions

PDF standards enable users to embed or include non-visible metadata within PDFs as attribute name and attribute value pairs. This feature can be used to embed referential metadata normally stored and used external to the files to help find or otherwise work with them.

Here are some reasons why embedding metadata values can be a good idea:

Make Original Folder Paths Searchable

Many times people creating file folder structures to store files use folder names that have informational value, e.g., an energy company might have top-level folders for country, then a deeper folder level for document type, then names of specific projects with latitude and longitude coordinates. When those files are indexed, the search or content management system will usually not permit searching the folder & file names, especially if they contain characters like underscores or dashes.

One solution is to parse the file path folders and build attribute name/value pairs to include within the PDF. Then when the search or content management system indexes the files with embedded PDFs those terms become searchable and usable.

This is particularly useful for organizations that use systems like:

  • EMC Documentum D2
  • OpenText
  • SharePoint with either FAST or MOSS

Facilitate Transferring Files

zoom_inContent management systems will accumulate metadata about individual files, and copying or transferring the entire collection or a subset of the collection will mean having to do an export of the metadata with path and file names that synch up with the target system. One way to avoid this complexity is to embed the metadata within the PDFs themselves so that the target system can just index them and extract the metadata name and value pairs at the same time.

Simplify Coordination of Multiple Versions

Some systems maintain different versions of documents for various reasons, e.g., there may be *.txt files with the same name and folder structure as image-only TIF files in order to provide search capability for the documents, or there may be different files containing translations of files. Rather than having to coordinate the maintenance of all those copies, they can be simply embedded in the PDF representations.

Overcome Broken Link Issues

Search and content management systems depend on files remaining where they were located when they were initially indexed. However, files sometimes get moved to new drives or drive mappings can change the paths to them. In either case the pointers to the indexed files are no longer valid and those files can be essentially lost from view. By including the referential metadata values within the PDF those values are no longer susceptible to being disconnected from the files. Whenever the files are indexed those metadata values can be re-incorporated in the index to help find and work with those files.

Get Benefits of Both PDF and Native Files

A major benefit of the DPF format is that PDFs can be widely distributed regardless of whether the recipients have the software that created the original files. However, distributing just PDFs can deprive the recipients who do have the original software of the ability to edit and work with those native files. Embedding the original native file in the PDF version gives a solution that is the best of both worlds.

Use Same Approach for Inferential Metadata

stopInferential metadata is data than can be inferred from examination of the files themselves, e.g., Loan Applicant, Well Number, Property Description, etc.  Having data values associated with specific fields or values enables far easier and more precise searching than is available with only text. Inferential attribute name and value pairs can be stored within PDFs in the same manner as referential metadata so that they can be used independently from the search or content system that manages them.

More Information

For more information on how BeyondRecognition can assist you with your information governance challenges, please contact info@beyondrecognition.net.

Related Post

Really Compressing PDFs Using Multiple Compression Algorithms

Comments are closed.