PDF documents

From Exterior Memory
Revision as of 21:24, 9 January 2013 by MacFreek (Talk | contribs) (Structure of a PDF document)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

PDF (Portable Document Format) is a device independent file format. A PDF file describes a fixed-layout and contains all data needed to display the contents. The advantage is that a PDF looks the same on all devices. A disadvantage is that a PDF does not take advantage of the display-capabilities of a particular device.

PDF is paginated -- a PDF document consists of one or more pages. This is unlike HTML and CSS, which are also device independent, but which are not paginated, and may adjust the layout based on the display-capabilities of a particular device. An advantage of PDF over HTML and CSS is that it embeds all the contents, including fonts and images.

Structure of a PDF document

For details, see the official PDF standard. The current standard is ISO 32000-1, which is available for a charge. The (same) text is available for free at Adobe, the original developers.

General Outline

A PDF document consists of four parts:

  1. A header, specifying the version (e.g. "%PDF-1.3" or "%PDF-1.7") and some "garbage bytes" (to ensure proper transmission of binary data)
  2. The body, a sequence of objects
  3. The cross reference (XREF), describing where to find each object (for fast random-access of a PDF file)
  4. The trailer, which tells where to find the XREF, the /Root object, the /Info object, and the /Size, the number of objects.

Newer versions of the PDF standard allow for multiple bodies, XREFs and trailers. This is to facilitate incremental updates, where data is appended to a PDF file without need to rewrite existing parts.


The body of a PDF document consists of multiple objects and references. Objects include:

  • Metadata and document info
  • Fonts and font information
  • Images
  • Pages

Each object contains properties, in a mix of these forms:

  • key-value pairs
  • key-reference pairs
  • a data stream

For example, a Page object may looks as follows:

6 0 obj
<</Contents 8 0 R /CropBox [0 0 595 842] /MediaBox [0 0 595 842]
  /Parent 5 0 R /Resources 9 0 R /Rotate 0 /Type /Page>>

This object is identified by the (6,0) pair (the object number and object generation number -- the later is used in case of incremental updates to a PDF). It contains 7 properties (/Contents, /Cropbox, /MediaBox, /Parent, /Resources, /Rotate and /Type). The value of /Type is /Page, so we know this is a page object. Some properties, such as /Contents and /Resources have a reference to another object. This is called an indirect object. For example "6 0 R" is a reference to object with object number 6 and object generation number 0. Other objects, such as /CropBox, /Rotate and /Type have a inline value. This value can be a integer (such as 0 for /Rotate), dictionary value (such as /Page for /Type) or more complex types such as an array of /CropBox. Other possible values include strings (between brackets), hexadecimal strings (between brackets), or other objects (between double brackets), such as shown in the following examples:

1 0 obj
<</Author (Dijkstra) /CreationDate (D:20070524152509+01'00') /Creator
  (PScript5.dll Version 5.2.2) /ModDate (D:20070524152509+01'00')
  /Producer (Acrobat Distiller 6.0 \(Mac\)) /Title (My Document)>>
2 0 obj
<</Metadata 3 0 R /PageLabels 4 0 R /Pages 5 0 R /Type /Catalog>>
5 0 obj
<</Count 2 /Kids [6 0 R 7 0 R] /Type /Pages>>
9 0 obj
<</ColorSpace <</Cs6 10 0 R>> /ExtGState <</GS1 11 0 R>> /Font
  <</TT2 12 0 R /TT4 13 0 R>> /ProcSet [/PDF /Text /ImageC] /XObject
  <</Im1 21 0 R>>>>
  [<01f19c06b79a321dc878ac6d6b7ccf6a> <da1e61ccd1740d4cb7484ba3c5d57b7f>]
  /Info 1 0 R /Root 2 0 R /Size 23>>

Some objects may have an additional data stream. Typical uses include Page Contents, Images and Metadata.

3 0 obj
<</Length 3346 /Subtype /XML /Type /Metadata>>
<?xpacket begin='w' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters esc="CRLF"?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='Acrobat Distiller 6.0 (<Mac)' />
<rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:xap='http://ns.adobe.com/xap/1.0/' xap:CreateDate='2007-05-24T15:25:09+01:00' xap:CreatorTool='PScript5.dll Version 5.2.2' xap:ModifyDate='2007-05-24T15:25:09+01:00' />
<rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:9776c93a-64a2-4a75-b81c-3122e81eaa64' />
<rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'>
  <dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Microsoft Word - My Document final.rtf</rdf:li></rdf:Alt></dc:title>
<?xpacket end='w'?>
8 0 obj
<</Filter /FlateDecode /Length 1908>>
H$¤WÛrã¸q}÷Wà•LYñ  .....  ¦y™Ù¹¸¼5qM­

As you can see, the stream can be binary or text, and these objects typically have properties describing the compression type (/FlateDecode is the Flate/Deflate compression as supported by zlib.) and data size.

Page objects

In a simple PDF, the trailer object points to a /Root object, which points to a /Pages object, which points to one or more /Page objects. See the above examples. In actual PDFs, the /Pages object may point to other /Pages objects (recursively building a page structure).

A page object points to a /Contents object and a /Resources object. The resource object describes different objects that are required to render the page, including images and font descriptions. For example:

<</ColorSpace <</Cs6 10 0 R>> /ExtGState <</GS1 11 0 R>> /Font
  <</TT2 12 0 R /TT4 13 0 R>> /ProcSet [/PDF /Text /ImageC] /XObject
  <</Im1 21 0 R>>>>

The /Contents object describes a stream, called the Content Stream, that describes the page lay-out.

Content Streams

A Content Stream is a sequence of instructions describing the how elements are laid out on a page.

A content stream is a subset of the Postscript language. It includes regular PDF objects and instructions, but not the programmatic constructs such as loops.

The content stream includes information to place certain text, place figures, add transparency, and select the font. Text is typically included inline, while figures and fonts are referred to by a dictionary name. These dictionary names match the names that are included in the /Resources object associated with a certain /Page object.

For example:

1 i 
72 813.74 67.74 -90 re
W* n
/GS1 gs
67.740005 0 0 89.939995 72 723.740295 cm
/Im1 Do
/GS1 gs
/TT2 1 Tf
12 0 0 12 139.74 723.7403 Tm
/Cs6 cs 0 0 0 scn
0 Tc
0 Tw
( )Tj
10.49 -0.93 TD
0.0022 Tw
[(Call for Pa)6.2(pers )]TJ

You see instructions such q, Q, and cm (save, restore and set the graphics state), i (set flatness tolerance for graphics), re, W*, and n (construct, paint and clip a path), gs (set the graphics state, as specified in a resource object, in this case the object identified by /GS1), Do (Paint the specified XObject, in this case an image identified by /Im1), BT (begin text), Tf (set the font, in this case the font specified in resource object /TT2), Tc and Tw (set character and word spacing), Tm (set text matrix), cs and scn (set color space base based on on a resource object, and set color), Tj and TJ (show text and show text allowing individual glyph positioning).