PDF documents

PDF (Portable Document Format) is a device independent file format. A PDF file describes a fixed-layout and contains all data needed to display the contents. The advantage is that a PDF looks the same on all devices. A disadvantage is that a PDF does not take advantage of the display-capabilities of a particular device.

PDF is paginated -- a PDF document consists of one or more pages. This is unlike HTML and CSS, which are also device independent, but which are not paginated, and may adjust the layout based on the display-capabilities of a particular device. An advantage of PDF over HTML and CSS is that it embeds all the contents, including fonts and images.

Structure of a PDF document
For details, see the official PDF standard. The current standard is ISO 32000-1, which is available for a charge. The same text is available for free at Adobe, the original developers.

General Outline
A PDF document consists of four parts:


 * 1) A header, specifying the version (e.g. "%PDF-1.3" or "%PDF-1.7") and some "garbage bytes" (to ensure proper transmission of binary data)
 * 2) The body, a sequence of objects
 * 3) The cross reference (XREF), describing where to find each object (for fast random-access of a PDF file)
 * 4) The trailer, which tells where to find the XREF, the /Root object, the /Info object, and the /Size, the number of objects.

Newer versions of the PDF standard allow for multiple bodies, XREFs and trailers. This is to facilitate incremental updates, where data is appended to a PDF file without need to rewrite existing parts.

Objects
The body of a PDF document consists of multiple objects and references. Objects include:


 * Metadata and document info
 * Fonts and font information
 * Images
 * Pages

Each object contains properties, in a mix of these forms:
 * key-value pairs
 * key-reference pairs
 * a data stream

For example, a Page object may looks as follows:

6 0 obj <> endobj

This object is identified by the (6,0) pair (the object number and object generation number -- the later is used in case of incremental updates to a PDF). It contains 7 properties (, ,  ,  ,  ,   and  ). The value of  is , so we know this is a page object. Some properties, such as  and   have a reference to another object. This is called an indirect object. For example " " is a reference to object with object number 6 and object generation number 0. Other objects, such as,   and   have a inline value. This value can be a integer (such as 0 for /Rotate), dictionary value (such as /Page for /Type) or more complex types such as an array of integers for. Other possible values include strings (between brackets), hexadecimal strings (between brackets), or other objects (between double brackets), such as shown in the following example objects:

1 0 obj <> endobj 2 0 obj <> endobj 5 0 obj <> endobj 9 0 obj <> /ExtGState <> /Font <> /ProcSet [/PDF /Text /ImageC] /XObject <>>> endobj

trailer < ] /Info 1 0 R /Root 2 0 R /Size 23>>

Some objects may have an additional data stream. Typical uses include Page Contents, Images and Metadata.

3 0 obj <> stream <?xpacket begin='w' id='W5M0MpCehiHzreSzNTczkc9d'?> <?adobe-xap-filters esc="CRLF"?>      <rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'> <dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Microsoft Word - My Document final.rtf</rdf:li></rdf:Alt></dc:title> <dc:creator><rdf:Seq><rdf:li>Dijkstra</rdf:li></rdf:Seq></dc:creator> </rdf:Description> </rdf:RDF> </x:xmpmeta> <?xpacket end='w'?> endstream endobj

8 0 obj <</Filter /FlateDecode /Length 1908>> stream H$¤WÛrã¸q}÷Wà•LYñ ..... ¦y™Ù¹¸¼5qM­ endstream endobj

As you can see, the stream can be binary or text, and these objects typically have properties describing the compression type ( is the Flate/Deflate compression as supported by zlib.) and data size.

Page objects
In a simple PDF, the  object points to a   object, which points to a   object, which points to one or more   objects. See the above examples. In actual PDFs, the  object may point to other   objects (recursively building a page structure).

A page object points to a  object and a   object. The resource object describes different objects that are required to render the page, including images and font descriptions. For example:

<</ColorSpace <</Cs6 10 0 R>> /ExtGState <</GS1 11 0 R>> /Font <</TT2 12 0 R /TT4 13 0 R>> /ProcSet [/PDF /Text /ImageC] /XObject <</Im1 21 0 R>>>>

The  object describes a stream, called the Content Stream, that describes the page lay-out.

Content Streams
A Content Stream is a sequence of instructions describing the how elements are laid out on a page.

A content stream is a subset of the Postscript language. It includes regular PDF objects and instructions, but not the programmatic constructs such as loops.

The content stream includes information to place certain text, place figures, add transparency, and select the font. Text is typically included inline, while figures and fonts are referred to by a dictionary name. These dictionary names match the names that are included in the  object associated with a certain   object.

For example: q 1 i 72 813.74 67.74 -90 re W* n /GS1 gs q 67.740005 0 0 89.939995 72 723.740295 cm /Im1 Do Q Q /GS1 gs BT /TT2 1 Tf 12 0 0 12 139.74 723.7403 Tm /Cs6 cs 0 0 0 scn 0 Tc 0 Tw Tj 10.49 -0.93 TD 0.0022 Tw [(Call for Pa)6.2(pers )]TJ ...

You see instructions such,  , and   (save, restore and set the graphics state),   (set flatness tolerance for graphics),  ,  , and   (construct, paint and clip a path),   (set the graphics state, as specified in a resource object, in this case the object identified by  ),   (Paint the specified XObject, in this case an image identified by  ),   and   (begin and end text),   (set the font, in this case the font specified in resource object  ),   and   (set character and word spacing),   (set text matrix),   and   (set color space base based on on a resource object, and set color),   and   (show text and show text allowing individual glyph positioning).

PDF libraries for Python
I've used the following libraries to programmatically parse and write PDF files from a script. Since there are many libraries, this overview only list open source libraries for Python:


 * pyPDF, pyPDF2, and PyPDF4:Can read and write PDF files, and parse up to the level of objects. Limited text extraction from content streams. Has some knowledge about the document info, but not about metadata embedded in a stream (such as RDF-based metadata). Similar to pdfrw and pdftools.
 * pdfrw:Can read and write PDF files, and parse up to the level of objects. Does not parse content streams (it can only compress/extract them). Clean Pythonesque interface. Limited functionality for encryption and compression, and has some problems with newer PDFs (version 1.4 and up). Similar to pyPDF and pdftools.
 * pdftools:Can read and write PDF files, and parse up to the level of objects. Does not parse content streams (it can only compress/extract them). Poor documentation. Development seems to have ended in 2008. Similar to pyPDF and pdfrw.
 * pdfminer:Can read PDF files, and is extremely good in extracting text, including elimination of spurious whitespaces, or reconstruction of text spread over multiple columns. Support for decryption. Can not create PDF files.
 * ReportLab Toolkit:Can write PDF files, with lots of control. Great documentation. Support for vector-based graphics may require the paid ReportLab PLUS version.
 * PDFLib Lite:C Library with Python bindings. Can write PDF files. Support for parsing files requires the paid PDF import library (PDI). Good documentation, but not specific to Python.

See also Other PDF manipulation libraries and tools for some good references.