PDF Document Metadata

Understanding the two metadata systems in PDF — the legacy DocInfo dictionary and the modern XMP metadata stream — and how to use them effectively

← Back to Blog

What Is PDF Metadata?

PDF metadata is descriptive information embedded within a PDF file that describes the document itself rather than its visual content. Also known as document properties, metadata includes fields such as the document's title, the name of its author, a subject description, keywords, the application that created the content (creator), the application that produced the PDF (producer), and the dates the document was created and last modified.

Metadata serves numerous practical purposes: it enables indexing and search in document management systems and operating system file search, supports archival by capturing provenance information, aids accessibility by providing programmatic access to the document's title, and is often required for regulatory compliance. For large organisations managing thousands of documents, well-structured metadata is the foundation of effective document governance.

The PDF format supports two distinct — and somewhat overlapping — mechanisms for storing metadata: the document information dictionary (DocInfo) and XMP metadata streams. Understanding both is important because they can contain conflicting information, and different tools read from different locations.

The Document Information Dictionary (DocInfo)

The DocInfo dictionary is the legacy metadata mechanism, present since the earliest versions of PDF. It is referenced from the PDF trailer via the /Info key:

trailer
<< /Size 42
   /Root 1 0 R
   /Info 41 0 R
>>

The information dictionary is a simple flat dictionary of string values. The standard keys defined in the PDF specification are:

  • /Title — the document's title as a human-readable string
  • /Author — the name of the person who created the document
  • /Subject — a summary of the document's topic or content
  • /Keywords — a space- or comma-delimited list of keywords for indexing
  • /Creator — the name of the application that created the original document (e.g. "Microsoft Word 2024", "Adobe InDesign 2024")
  • /Producer — the name of the application that converted or generated the PDF (e.g. "Adobe PDF Library 23.0", "Ghostscript 10.0")
  • /CreationDate — the date and time the document was created, in PDF date format: D:YYYYMMDDHHmmSSOHH'mm'
  • /ModDate — the date and time the document was most recently modified
  • /Trapped — a name value (/True, /False, or /Unknown) indicating whether the document has been trapped for printing

The DocInfo dictionary's simplicity is also its limitation: it supports only a fixed set of fields, all plain strings, with no support for multiple values per field, language variants, or structured data. For richer metadata, XMP is needed.

XMP: Extensible Metadata Platform

XMP (Extensible Metadata Platform) is Adobe's XML-based metadata standard, introduced in 2001 and subsequently published as ISO 16684. XMP stores metadata as an RDF/XML document embedded within a file's binary stream. In PDF, the XMP packet is stored as a stream object referenced from the document catalog:

<< /Type /Catalog
   /Pages 2 0 R
   /Metadata 40 0 R
>>

The XMP stream contains an XML document wrapped in processing instructions:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about=""
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <dc:title>
        <rdf:Alt><rdf:li xml:lang="x-default">Annual Report 2026</rdf:li></rdf:Alt>
      </dc:title>
      <dc:creator>
        <rdf:Seq><rdf:li>Jane Smith</rdf:li></rdf:Seq>
      </dc:creator>
      <xmp:CreateDate>2026-01-15T09:30:00+00:00</xmp:CreateDate>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

XMP supports the standard Dublin Core properties (title, creator, description, subject, etc.) under the dc: namespace, Adobe-specific properties under the xmp: and xmpMM: namespaces (including document ID, version ID, and document history), and PDF-specific properties under the pdf: namespace (keywords, PDF version, producer). The RDF data model allows multi-valued fields (sequences and bags of values), language-alternative strings, and structured complex types — none of which are possible in the flat DocInfo dictionary.

XMP Across Adobe Creative Suite

XMP is not limited to PDF. Adobe embedded XMP metadata support across its entire Creative Suite and released the XMP toolkit as open source. As a result, XMP metadata is found in JPEG, TIFF, PNG, PSD, AI, INDD, and many other file formats. This means that metadata authored in Photoshop — such as IPTC copyright information and location data — can persist through the InDesign layout process and into the final exported PDF, preserving a complete provenance chain.

XMP also supports document history (tracked in the xmpMM:History sequence), recording each save operation with a timestamp, the software that performed the save, and the parameters used. This history can become very lengthy in documents that have been edited many times, and is sometimes stripped during PDF optimisation to reduce file size.

Reading and Editing Metadata in Acrobat

In Adobe Acrobat, document properties are accessible via File > Properties (or Ctrl+D / Cmd+D). The Description tab shows the standard DocInfo fields — Title, Author, Subject, and Keywords — and the Created, Modified, Application (Creator), and PDF Producer values. These fields can be edited directly in this dialog, and Acrobat will update both the DocInfo dictionary and the XMP packet to keep them synchronised.

For access to the raw XMP data, the Additional Metadata button in the Description tab opens the Full Metadata dialog, which shows the XMP packet as a tree of namespaces and properties and allows viewing the raw XML. However, editing the raw XML in Acrobat's UI is not supported; raw XMP manipulation requires scripting or external tools.

Custom Metadata Schemas

One of XMP's most powerful features is extensibility through custom namespaces. An organisation can define its own metadata schema — for example, to track document classification levels, internal project codes, approval status, or regulatory submission identifiers — and embed this data in the XMP packet alongside standard properties.

A custom XMP namespace is defined with a unique URI and a preferred prefix. Properties within the namespace follow the same RDF typing rules as standard XMP properties. For instance, a pharmaceutical company might define a namespace http://www.example-pharma.com/xmp/submission/1.0/ with properties such as sub:CTDSection, sub:SubmissionID, and sub:ApprovalDate.

Custom metadata schemas are particularly valuable in document management and archival systems where the PDF's embedded metadata drives automated classification, routing, and retention rules without requiring the file name or folder path to convey this information.

Why Metadata Matters

The practical importance of accurate, complete PDF metadata spans several domains:

  • Search and retrieval — Document management systems, enterprise search engines, and operating system search (Windows Search, macOS Spotlight) index PDF metadata, making documents findable by title, author, or keywords even when the full text cannot be searched.
  • Accessibility — The PDF/UA standard (ISO 14289) requires the document title to be set in both the DocInfo and XMP, and requires that the /DisplayDocTitle viewer preference be set to true so that assistive technologies present the document's title rather than its file name.
  • Archival and preservation — The PDF/A archival standard mandates XMP metadata and constrains what it may contain. Archive management systems rely on embedded metadata to identify, catalogue, and apply retention policies to archived documents.
  • Compliance and audit — Regulatory filings, contract management, and quality management systems may require specific metadata fields to be populated correctly as part of submission or audit requirements.
  • Print production — The /Trapped field and other XMP-based production metadata (such as output intent profiles) are used by prepress workflows and RIPs to handle documents correctly.

Metadata in PDF/A

PDF/A is the ISO standard for long-term archiving of PDF documents. All conformance levels of PDF/A (PDF/A-1, PDF/A-2, PDF/A-3, and PDF/A-4) require XMP metadata to be present and conformant. Specific requirements include:

  • The XMP packet must be present in the document catalog's /Metadata stream
  • The PDF/A conformance level must be declared using the pdfaid: namespace with pdfaid:part (the part number, e.g. "2") and pdfaid:conformance (the conformance level, e.g. "B" for basic, "U" for unicode, "A" for accessible)
  • The DocInfo dictionary, if present, must be consistent with the XMP (values for Title, Author, Subject, Keywords, Creator, Producer, CreationDate, and ModDate must match between the two)
  • Extension schemas must be described using the XMP extension schema description vocabulary defined in ISO 16684

Inconsistency between DocInfo and XMP is one of the most common PDF/A validation failures, often introduced when metadata is edited through tools that update only one of the two locations.

Scripting Metadata Changes

For bulk metadata operations across large document sets, Acrobat JavaScript and Adobe ExtendScript provide programmatic access to document properties.

In Acrobat JavaScript, the Doc object exposes document properties directly:

// Read document title
var title = this.info.Title;

// Set multiple properties
this.info.Title = "2026 Annual Report";
this.info.Author = "Finance Department";
this.info.Subject = "Financial Results";
this.info.Keywords = "annual report, financials, 2026";

For XMP metadata, Acrobat JavaScript exposes the XMP object (available in Acrobat Pro) which allows reading and writing arbitrary XMP properties including custom namespace properties. The Acrobat SDK also provides C++ APIs for XMP manipulation at the plugin level, and Adobe's open-source XMP Toolkit SDK allows server-side metadata processing without Acrobat.

For high-volume metadata remediation — such as backfilling missing titles across a large document repository, or normalising author names to a canonical form — a combination of the Acrobat JavaScript batch framework or a server-side tool using the XMP Toolkit is the most practical approach.

PDF Metadata and Document Automation

Mapsoft can help you automate metadata management across large PDF document sets — from compliance checking to bulk remediation and custom schema deployment.