pikepdf API

Primary objects

class pikepdf.Pdf

In-memory representation of a PDF

Root

the /Root object of the PDF

attach(*, basename, filebytes, mime=None, desc='')

Attach a file to this PDF

Parameters:
  • basename (str) – The basename (filename withouth path) to name the file. Not necessarily the name of the file on disk. Will be shown to the user by the PDF viewer.
  • filebytes (bytes) – The file contents.
  • mime (str or None) – A MIME type for the filebytes. If omitted, we try to guess based on the standard library’s mimetypes.guess_type(). If this cannot be determined, the generic value application/octet-stream is used. This value is used by PDF viewers to decide how to present the information to the user.
  • desc (str) – A extended description of the file contents. PDF viewers also display this information to the user. In Acrobat DC this is hidden in a context menu.

The PDF will also be modified to request the PDF viewer to display the list of attachments when opened, as opposed to other viewing modes. Some PDF viewers will not make it obvious to the user that attachments are present unless this is done. This behavior may be overridden by changing pdf.Root.PageMode to some other valid value.

check_linearization(self: pikepdf._qpdf.Pdf, stream: object=sys.stderr) → None

Reports information on the PDF’s linearization

Parameters:stream – A stream to write this information too; must implement .write() and .flush() method. Defaults to sys.stderr.
copy_foreign(self: pikepdf._qpdf.Pdf, arg0: QPDFObjectHandle) → QPDFObjectHandle

Copy object from foreign PDF to this one.

filename

the source filename of an existing PDF, when available

get_object(*args, **kwargs)

Overloaded function.

  1. get_object(self: pikepdf._qpdf.Pdf, arg0: Tuple[int, int]) -> QPDFObjectHandle

    Look up an object by ID and generation number

    Returns:

    pikepdf.Object

  2. get_object(self: pikepdf._qpdf.Pdf, arg0: int, arg1: int) -> QPDFObjectHandle

    Look up an object by ID and generation number

    Returns:

    pikepdf.Object

get_warnings(self: pikepdf._qpdf.Pdf) → List[QPDFExc]
is_linearized

Returns True if the PDF is linearized.

Specifically returns True iff the file starts with a linearization parameter dictionary. Does no additional validation.

make_indirect(*args, **kwargs)

Overloaded function.

  1. make_indirect(self: pikepdf._qpdf.Pdf, arg0: QPDFObjectHandle) -> QPDFObjectHandle

    Attach an object to the Pdf as an indirect object

    Direct objects appear inline in the binary encoding of the PDF. Indirect objects appear inline as references (in English, “look up object 4 generation 0”) and then read from another location in the file. The PDF specification requires that certain objects are indirect - consult the PDF specification to confirm.

    Generally a resource that is shared should be attached as an indirect object. pikepdf.Stream objects are always indirect, and creating them will automatically attach it to the Pdf.

    See also pikepdf.Object.is_indirect().

    Returns:

    pikepdf.Object

  2. make_indirect(self: pikepdf._qpdf.Pdf, arg0: object) -> QPDFObjectHandle

    Encode a Python object and attach to this Pdf as an indirect object

    Returns:

    pikepdf.Object

metadata

access the document information dictionary

new() → pikepdf._qpdf.Pdf

create a new empty PDF from stratch

open(filename_or_stream: object, password: str='', hex_password: bool=False, ignore_xref_streams: bool=False, suppress_warnings: bool=True, attempt_recovery: bool=True, inherit_page_attributes: bool=True) → pikepdf._qpdf.Pdf

Open an existing file at filename_or_stream.

If filename_or_stream is path-like, the file will be opened.

If filename_or_stream has .read() and .seek() methods, the file will be accessed as a readable binary stream. pikepdf will read the entire stream into a private buffer.

Parameters:
  • filename_or_stream (os.PathLike) – Filename of PDF to open
  • password (str or bytes) – User or owner password to open an encrypted PDF. If a str is given it will be converted to UTF-8.
  • hex_password (bool) – If True, interpret the password as a hex-encoded version of the exact encryption key to use, without performing the normal key computation. Useful in forensics.
  • ignore_xref_streams (bool) – If True, ignore cross-reference streams. See qpdf documentation.
  • suppress_warnings (bool) – If True (default), warnings are not printed to stderr. Use get_warnings() to retrieve warnings.
  • attempt_recovery (bool) – If True (default), attempt to recover from PDF parsing errors.
  • inherit_page_attributes (bool) – If True (default), push attributes set on a group of pages to individual pages
Raises:
pdf_version

the PDF standard version, such as ‘1.7’

remove_unreferenced_resources(self: pikepdf._qpdf.Pdf) → None

Remove from /Resources of each page any object not referenced in page’s contents

PDF pages may share resource dictionaries with other pages. If pikepdf is used for page splitting, pages may reference resources in their /Resources dictionary that are not actually required. This purges all unnecessary resource entries.

Suggested before saving.

root

alias for .Root, the /Root object of the PDF

save(self: pikepdf._qpdf.Pdf, filename: object, static_id: bool=False, preserve_pdfa: bool=True, min_version: str='', force_version: str='', compress_streams: bool=True, object_stream_mode: pikepdf._qpdf.ObjectStreamMode=ObjectStreamMode.preserve, stream_data_mode: pikepdf._qpdf.StreamDataMode=StreamDataMode.preserve, normalize_content: bool=False, linearize: bool=False, progress: object=None) → None

Save all modifications to this pikepdf.Pdf

Parameters:
  • filename (str or stream) – Where to write the output
  • static_id (bool) – Indicates that the /ID metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be generated deterministically. Normally for debugging.
  • preserve_pdfa (bool) – Ensures that the file is generated in a manner compliant with PDF/A and other stricter variants. This should be True, the default, in most cases.
  • min_version (str) – Sets the minimum version of PDF specification that should be required. If left alone QPDF will decide.
  • force_version (str) – Override the version recommend by QPDF, potentially creating an invalid file that does not display in old versions. See QPDF manual for details.
  • object_stream_mode (pikepdf.ObjectStreamMode) – disable prevents the use of object streams. preserve keeps object streams from the input file. generate uses object streams wherever possible, creating the smallest files but requiring PDF 1.5+.
  • stream_data_mode (pikepdf.StreamDataMode) – uncompress decompresses all data. preserve keeps existing compressed objects compressed. compress attempts to compress all objects.
  • normalize_content (bool) – Enables parsing and reformatting the content stream within PDFs. This may debugging PDFs easier.
  • linearize (bool) – Enables creating linear or “fast web view”, where the file’s contents are organized sequentially so that a viewer can begin rendering before it has the whole file. As a drawback, it tends to make files larger.

You may call .save() multiple times with different parameters to generate different versions of a file, and you may continue to modify the file after saving it. .save() does not modify the Pdf object in memory.

Note

pikepdf.Pdf.remove_unreferenced_resources() before saving may eliminate unnecessary resources from the output file, so calling this method before saving is recommended. This is not done automatically because .save() is intended to be idempotent.

show_xref_table(self: pikepdf._qpdf.Pdf) → None

Pretty-print the Pdf’s xref (cross-reference table)

trailer

Provides access to the PDF trailer object.

See section 7.5.5 of the PDF reference manual. Generally speaking, the trailer should not be modified with pikepdf, and modifying it may not work. Some of the values in the trailer are automatically changed when a file is saved.

pikepdf.open(*args, **kwargs)

Alias for pikepdf.Pdf.open().

class pikepdf.ObjectStreamMode
disable
preserve
generate
class pikepdf.StreamDataMode
uncompress
preserve
compress
exception pikepdf.PdfError
exception pikepdf.PasswordError

Object construction

class pikepdf.Object
as_dict(self: pikepdf._qpdf.Object) → Dict[str, pikepdf._qpdf.Object]
as_list(self: pikepdf._qpdf.Object) → List[pikepdf._qpdf.Object]
get(self: pikepdf._qpdf.Object, key: str, default_: object=None) → object

for dictionary objects, behave as dict.get(key, default=None)

get_raw_stream_buffer(self: pikepdf._qpdf.Object) → pikepdf._qpdf.Buffer

Return a buffer protocol buffer describing the raw, encoded stream

get_stream_buffer(self: pikepdf._qpdf.Object) → pikepdf._qpdf.Buffer

Return a buffer protocol buffer describing the decoded stream

is_owned_by(self: pikepdf._qpdf.Object, arg0: pikepdf._qpdf.Pdf) → bool

Test if this object is owned by the indicated possible_owner.

keys(self: pikepdf._qpdf.Object) → Set[str]
objgen

Return the object-generation number pair for this object

If this is a direct object, then the returned value is (0, 0). By definition, if this is an indirect object, it has a “objgen”, and can be looked up using this in the cross-reference (xref) table. Direct objects cannot necessarily be looked up.

The generation number is usually 0, except for PDFs that have been incrementally updated.

page_contents_add(self: pikepdf._qpdf.Object, contents: pikepdf._qpdf.Object, prepend: bool=False) → None

Append or prepend to an existing page’s content stream.

page_contents_coalesce(self: pikepdf._qpdf.Object) → None
parse(stream: str, description: str='') → pikepdf._qpdf.Object

Parse PDF binary representation into PDF objects.

read_bytes(self: pikepdf._qpdf.Object) → bytes

Decode and read the content stream associated with this object

read_raw_bytes(self: pikepdf._qpdf.Object) → bytes

Read the content stream associated with this object without decoding

unparse(self: pikepdf._qpdf.Object, resolved: bool=False) → bytes

Convert PDF objects into their binary representation, optionally resolving indirect objects.

write(self: pikepdf._qpdf.Object, arg0: bytes, *args, **kwargs) → None

Replace the content stream with data, compressed according to filter and decode_parms

Parameters:
  • data (bytes) – the new data to use for replacement
  • filter – The filter(s) with which the data is (already) encoded
  • decode_parms – Parameters for the filters with which the object is encode

If only one filter is specified, it may be a name such as Name(‘/FlateDecode’). If there are multiple filters, then array of names should be given.

If there is only one filter, decode_parms is a Dictionary of parameters for that filter. If there are multiple filters, then decode_parms is an Array of Dictionary, where each array index is corresponds to the filter.

class pikepdf.Name

Constructs a PDF Name object

Names can be constructed with two notations:

  1. Name.Resources
  2. Name('/Resources')

The two are semantically equivalent. The former is preferred for names that are normally expected to be in a PDF. The latter is preferred for dynamic names and attributes.

static __new__(cls, name)

Create and return a new object. See help(type) for accurate signature.

class pikepdf.String

Constructs a PDF String object

static __new__(cls, s)
Parameters:s (str or bytes) – The string to use. String will be encoded for PDF, bytes will be constructed without encoding.
Returns:pikepdf.Object
class pikepdf.Array

Constructs a PDF Array object

static __new__(cls, a=[])
Parameters:a (iterable) – A list of objects. All objects must be either pikepdf.Object or convertible to pikepdf.Object.
Returns:pikepdf.Object
class pikepdf.Dictionary

Constructs a PDF Dictionary object

static __new__(cls, d=None, **kwargs)

Constructs a PDF Dictionary from either a Python dict or keyword arguments.

These two examples are equivalent:

pikepdf.Dictionary({'/NameOne': 1, '/NameTwo': 'Two'})

pikepdf.Dictionary(NameOne=1, NameTwo='Two')

In either case, the keys must be strings, and the strings correspond to the desired Names in the PDF Dictionary. The values must all be convertible to pikepdf.Object.

Returns:pikepdf.Object
class pikepdf.Stream

Constructs a PDF Stream object

static __new__(cls, owner, obj)
Parameters:
Returns:

pikepdf.Object

class pikepdf.Operator(arg0: str) → pikepdf._qpdf.Object

Construct a PDF Operator object for use in content streams

Support models

pikepdf.parse_content_stream(page_or_stream, operators='')

Parse a PDF content stream into a sequence of instructions.

A PDF content stream is list of instructions that describe where to render the text and graphics in a PDF. This is the starting point for analyzing PDFs.

If the input is a page and page.Contents is an array, then the content stream is automatically treated as one coalesced stream.

Each instruction contains at least one operator and zero or more operands.

Parameters:
  • page_or_stream (pikepdf.Object) – A page object, or the content stream attached to another object such as a Form XObject.
  • operators (str) – A space-separated string of operators to whitelist. For example ‘q Q cm Do’ will return only operators that pertain to drawing images. Use ‘BI ID EI’ for inline images. All other operators and associated tokens are ignored. If blank, all tokens are accepted.
Returns:

List of (operands, command) tuples where command is an

operator (str) and operands is a tuple of str; the PDF drawing command and the command’s operands, respectively.

Return type:

list

Example

>>> pdf = pikepdf.Pdf.open(input_pdf)
>>> page = pdf.pages[0]
>>> for operands, command in parse_content_stream(page):
>>>     print(command)
class pikepdf.PdfMatrix(*args)

Support class for PDF content stream matrices

PDF content stream matrices are 3x3 matrices summarized by a shorthand (a, b, c, d, e, f) which correspond to the first two column vectors. The final column vector is always (0, 0, 1) since this is using homogenous coordinates.

PDF uses row vectors. That is, vr @ A' gives the effect of transforming a row vector vr=(x, y, 1) by the matrix A'. Most textbook treatments use A @ vc where the column vector vc=(x, y, 1)'.

(@ is the Python matrix multiplication operator added in Python 3.5.)

Addition and other operations are not implemented because they’re not that meaningful in a PDF context (they can be defined and are mathematically meaningful in general).

PdfMatrix objects are immutable. All transformations on them produce a new matrix.

a
b
c
d
e
f

Return one of the six “active values” of the matrix.

encode()

Encode this matrix in binary suitable for including in a PDF

static identity()

Constructs and returns an identity matrix

rotated(angle_degrees_ccw)

Concatenates a rotation matrix on this matrix

scaled(x, y)

Concatenates a scaling matrix on this matrix

shorthand

Return the 6-tuple (a,b,c,d,e,f) that describes this matrix

translated(x, y)

Translates this matrix

class pikepdf.PdfImage(obj)

Support class to provide a consistent API for manipulating PDF images

The data structure for images inside PDFs is irregular and flexible, making it difficult to work with without introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic API similar in spirit (and convertible to) the Python Pillow imaging library.

as_pil_image()

Extract the image as a Pillow Image, using decompression as necessary

Returns:PIL.Image.Image
extract_to(*, stream)

Attempt to extract the image directly to a usable image file

If possible, the compressed data is extracted and inserted into a compressed image file format without transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to an appropriate format.

Because it is not known until attempted what image format will be extracted, users should not assume what format they are getting back. When saving the image to a file, use a temporary filename, and then rename the file to its final name based on the returned file extension.

Parameters:stream – Writable stream to write data to
Returns:The file format extension
Return type:str
get_stream_buffer()

Access this image with the buffer protocol

is_inline

False for image XObject

read_bytes()

Decompress this image and return it as unencoded bytes

show()

Show the image however PIL wants to

class pikepdf.PdfInlineImage(*, image_data, image_object: tuple)

Support class for PDF inline images