PDF (Portable Document Format) is a device independent file format. A PDF file describes a fixed-layout and contains all data needed to display the contents. The advantage is that a PDF looks the same on all devices. A disadvantage is that a PDF does not take advantage of the display-capabilities of a particular device.
PDF is paginated -- a PDF document consists of one or more pages. This is unlike HTML and CSS, which are also device independent, but which are not paginated, and may adjust the layout based on the display-capabilities of a particular device. An advantage of PDF over HTML and CSS is that it embeds all the contents, including fonts and images.
Structure of a PDF document
A PDF document consists of four parts:
- A header, specifying the version (e.g. "%PDF-1.3" or "%PDF-1.7") and some "garbage bytes" (to ensure proper transmission of binary data)
- The body, a sequence of objects
- The cross reference (XREF), describing where to find each object (for fast random-access of a PDF file)
- The trailer, which tells where to find the XREF, the /Root object, the /Info object, and the /Size, the number of objects.
Newer versions of the PDF standard allow for multiple bodies, XREFs and trailers. This is to facilitate incremental updates, where data is appended to a PDF file without need to rewrite existing parts.
The body of a PDF document consists of multiple objects and references. Objects include:
- Metadata and document info
- Fonts and font information
Each object contains properties, in a mix of these forms:
- key-value pairs
- key-reference pairs
- a data stream
For example, a Page object may looks as follows:
6 0 obj <</Contents 8 0 R /CropBox [0 0 595 842] /MediaBox [0 0 595 842] /Parent 5 0 R /Resources 9 0 R /Rotate 0 /Type /Page>> endobj
This object is identified by the (6,0) pair (the object number and object generation number -- the later is used in case of incremental updates to a PDF). It contains 7 properties (
/Type). The value of
/Page, so we know this is a page object. Some properties, such as
/Resources have a reference to another object. This is called an indirect object. For example "
6 0 R" is a reference to object with object number 6 and object generation number 0. Other objects, such as
/Type have a inline value. This value can be a integer (such as 0 for /Rotate), dictionary value (such as /Page for /Type) or more complex types such as an array of integers for
/CropBox. Other possible values include strings (between brackets), hexadecimal strings (between brackets), or other objects (between double brackets), such as shown in the following example objects:
1 0 obj <</Author (Dijkstra) /CreationDate (D:20070524152509+01'00') /Creator (PScript5.dll Version 5.2.2) /ModDate (D:20070524152509+01'00') /Producer (Acrobat Distiller 6.0 \(Mac\)) /Title (My Document)>> endobj 2 0 obj <</Metadata 3 0 R /PageLabels 4 0 R /Pages 5 0 R /Type /Catalog>> endobj 5 0 obj <</Count 2 /Kids [6 0 R 7 0 R] /Type /Pages>> endobj 9 0 obj <</ColorSpace <</Cs6 10 0 R>> /ExtGState <</GS1 11 0 R>> /Font <</TT2 12 0 R /TT4 13 0 R>> /ProcSet [/PDF /Text /ImageC] /XObject <</Im1 21 0 R>>>> endobj
trailer <</ID [<01f19c06b79a321dc878ac6d6b7ccf6a> <da1e61ccd1740d4cb7484ba3c5d57b7f>] /Info 1 0 R /Root 2 0 R /Size 23>>
Some objects may have an additional data stream. Typical uses include Page Contents, Images and Metadata.
3 0 obj <</Length 3346 /Subtype /XML /Type /Metadata>> stream <?xpacket begin='w' id='W5M0MpCehiHzreSzNTczkc9d'?> <?adobe-xap-filters esc="CRLF"?> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'> <rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='Acrobat Distiller 6.0 (<Mac)' /> <rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:xap='http://ns.adobe.com/xap/1.0/' xap:CreateDate='2007-05-24T15:25:09+01:00' xap:CreatorTool='PScript5.dll Version 5.2.2' xap:ModifyDate='2007-05-24T15:25:09+01:00' /> <rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:9776c93a-64a2-4a75-b81c-3122e81eaa64' /> <rdf:Description rdf:about='uuid:7600fa4d-b451-407b-8421-6c477cc2d751' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'> <dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Microsoft Word - My Document final.rtf</rdf:li></rdf:Alt></dc:title> <dc:creator><rdf:Seq><rdf:li>Dijkstra</rdf:li></rdf:Seq></dc:creator> </rdf:Description> </rdf:RDF> </x:xmpmeta> <?xpacket end='w'?> endstream endobj
8 0 obj <</Filter /FlateDecode /Length 1908>> stream H$¤WÛrã¸q}÷Wà•LYñ ..... ¦y™Ù¹¸¼5qM endstream endobj
As you can see, the stream can be binary or text, and these objects typically have properties describing the compression type (
/FlateDecode is the Flate/Deflate compression as supported by zlib.) and data size.
In a simple PDF, the
trailer object points to a
/Root object, which points to a
/Pages object, which points to one or more
/Page objects. See the above examples. In actual PDFs, the
/Pages object may point to other
/Pages objects (recursively building a page structure).
A page object points to a
/Contents object and a
/Resources object. The resource object describes different objects that are required to render the page, including images and font descriptions. For example:
<</ColorSpace <</Cs6 10 0 R>> /ExtGState <</GS1 11 0 R>> /Font <</TT2 12 0 R /TT4 13 0 R>> /ProcSet [/PDF /Text /ImageC] /XObject <</Im1 21 0 R>>>>
/Contents object describes a stream, called the Content Stream, that describes the page lay-out.
A Content Stream is a sequence of instructions describing the how elements are laid out on a page.
A content stream is a subset of the Postscript language. It includes regular PDF objects and instructions, but not the programmatic constructs such as loops.
The content stream includes information to place certain text, place figures, add transparency, and select the font. Text is typically included inline, while figures and fonts are referred to by a dictionary name. These dictionary names match the names that are included in the
/Resources object associated with a certain
q 1 i 72 813.74 67.74 -90 re W* n /GS1 gs q 67.740005 0 0 89.939995 72 723.740295 cm /Im1 Do Q Q /GS1 gs BT /TT2 1 Tf 12 0 0 12 139.74 723.7403 Tm /Cs6 cs 0 0 0 scn 0 Tc 0 Tw ( )Tj 10.49 -0.93 TD 0.0022 Tw [(Call for Pa)6.2(pers )]TJ ...
You see instructions such
cm (save, restore and set the graphics state),
i (set flatness tolerance for graphics),
n (construct, paint and clip a path),
gs (set the graphics state, as specified in a resource object, in this case the object identified by
Do (Paint the specified XObject, in this case an image identified by
ET (begin and end text),
Tf (set the font, in this case the font specified in resource object
Tw (set character and word spacing),
Tm (set text matrix),
scn (set color space base based on on a resource object, and set color),
TJ (show text and show text allowing individual glyph positioning).
PDF libraries for Python
I've used the following libraries to programmatically parse and write PDF files from a script. Since there are many libraries, this overview only list open source libraries for Python:
- pyPDF, pyPDF2, and PyPDF4
- Can read and write PDF files, and parse up to the level of objects. Limited text extraction from content streams. Has some knowledge about the document info, but not about metadata embedded in a stream (such as RDF-based metadata). Similar to pdfrw and pdftools.
- Can read and write PDF files, and parse up to the level of objects. Does not parse content streams (it can only compress/extract them). Clean Pythonesque interface. Limited functionality for encryption and compression, and has some problems with newer PDFs (version 1.4 and up). Similar to pyPDF and pdftools.
- Can read and write PDF files, and parse up to the level of objects. Does not parse content streams (it can only compress/extract them). Poor documentation. Development seems to have ended in 2008. Similar to pyPDF and pdfrw.
- Can read PDF files, and is extremely good in extracting text, including elimination of spurious whitespaces, or reconstruction of text spread over multiple columns. Support for decryption. Can not create PDF files.
- ReportLab Toolkit
- Can write PDF files, with lots of control. Great documentation. Support for vector-based graphics may require the paid ReportLab PLUS version.
- PDFLib Lite
- C Library with Python bindings. Can write PDF files. Support for parsing files requires the paid PDF import library (PDI). Good documentation, but not specific to Python.
See also Other PDF manipulation libraries and tools for some good references.