TextPage¶
This class represents text and images shown on a document page. All MuPDF document types are supported.
The usual ways to create a textpage are DisplayList.getTextPage()
and Page.getTextPage()
. Because there is a limited set of methods in this class, there exist wrappers in the Page class, which incorporate creating an intermediate text page and then invoke one of the following methods. The last column of this table shows these corresponding Page methods.
For a description of what this class is all about, see Appendix 2.
Method |
Description |
page getText or search method |
---|---|---|
extract plain text |
“text” |
|
synonym of previous |
“text” |
|
plain text grouped in blocks |
“blocks” |
|
all words with their bbox |
“words” |
|
page content in HTML format |
“html” |
|
page content in JSON format |
“json” |
|
page content in XHTML format |
“xhtml” |
|
page text in XML format |
“xml” |
|
page content in dict format |
“dict” |
|
page content in dict format |
“rawdict” |
|
Search for a string in the page |
searchFor() |
Class API
-
class
TextPage
¶ -
extractText
()¶
-
extractTEXT
()¶ Return a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as specified at the time of document creation.
- Return type
str
-
extractBLOCKS
()¶ Textpage content as a list of text lines grouped by block. Each list items looks like this:
(x0, y0, x1, y1, "lines in blocks", block_type, block_no)
The first four entries are the block’s bbox coordinates, block_type is 1 for an image block, 0 for text. block_no is the block sequence number.
For an image block, its bbox and a text line with image meta information is included – not the image data itself.
This is a high-speed method with enough information to rebuild a desired text sequence.
- Return type
list
-
extractWORDS
()¶ Textpage content as a list of single words with bbox information. An item of this list looks like this:
(x0, y0, x1, y1, "word", block_no, line_no, word_no)
Everything wrapped in spaces is treated as a “word” with this method.
This is a high-speed method which e.g. allows extracting text from within a given rectangle.
- Return type
list
-
extractHTML
()¶ Textpage content in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output in Python. Your internet browser should be able to adequately display this information, but see Controlling Quality of HTML Output.
- Return type
str
-
extractDICT
()¶ Textpage content as a Python dictionary. Provides same information detail as HTML. See below for the structure.
- Return type
dict
-
extractJSON
()¶ Textpage content in JSON format. Created by json.dumps(TextPage.extractDICT()). It is included for backlevel compatibility. You will probably use this method ever only for outputting the result in some file. The method detects binary image data, like bytearray and bytes (Python 3 only) and converts them to base64 encoded strings on JSON output.
- Return type
str
-
extractXHTML
()¶ Textpage content in XHTML format. Text information detail is comparable with
extractTEXT()
, but also contains images (base64 encoded). This method makes no attempt to re-create the original visual appearance.- Return type
str
-
extractXML
()¶ Textpage content in XML format. This contains complete formatting information about every single character on the page: font, size, line, paragraph, location, color, etc. Contains no images. You probably need an XML package to interpret the output in Python.
- Return type
str
-
extractRAWDICT
()¶ Textpage content as a Python dictionary – technically similar to
extractDICT()
, and it contains that information as a subset (including any images). It provides additional detail down to each character, which makes using XML obsolete in many cases. See below for the structure.- Return type
dict
-
search
(string, hit_max = 16, quads = False)¶ Search for string and return a list of found locations.
- Parameters
string (str) – the string to search for. Upper / lower cases will all match.
hit_max (int) – maximum number of returned hits (default 16).
quads (bool) – return quadrilaterals instead of rectangles.
- Return type
list
- Returns
a list of Rect or Quad objects, each surrounding a found string occurrence. The search string may contain spaces, it may therefore happen, that its parts are located on different lines. In this case, more than one rectangle (resp. quadrilateral) are returned. The method does not support hyphenation, so it will not find “meth-od” when searching for “method”.
Example: If the search for string “pymupdf” contains a hit like shown, then the corresponding entry will either be the blue rectangle, or, if quads was specified, Quad(ul, ur, ll, lr).
-
Dictionary Structure of extractDICT()
and extractRAWDICT()
¶

Page Dictionary¶
Key |
Value |
---|---|
width |
page width in pixels (float) |
height |
page height in pixels (float) |
blocks |
list of block dictionaries |
Block Dictionaries¶
Blocks come in two different formats: image blocks and text blocks.
Image block:
Key |
Value |
---|---|
type |
1 = image (int) |
bbox |
block / image rectangle, formatted as tuple(fitz.Rect) |
ext |
image type (str), as file extension, see below |
width |
original image width (int) |
height |
original image height (int) |
colorspace |
colorspace.n (int) |
xres |
resolution in x-direction (int) |
yres |
resolution in y-direction (int) |
bpc |
bits per component (int) |
image |
image content (bytes or bytearray) |
Possible values of key “ext” are “bmp”, “gif”, “jpeg”, “jpx” (JPEG 2000), “jxr” (JPEG XR), “png”, “pnm”, and “tiff”.
Note
In some error situations, all of the above values may be zero or empty. So, please be prepared to digest items like:
{"type": 1, "bbox": (0.0, 0.0, 0.0, 0.0), ..., "image": b""}
TextPage and corresponding method
Page.getText()
are available for all document types. Only for PDF documents, methodsDocument.getPageImageList()
/Page.getImageList()
offer some overlapping functionality as far as image lists are concerned. But both lists may or may not contain the same items. Any differences are most probably caused by one of the following:“Inline” images (see page 352 of the Adobe PDF References) of a PDF page are contained in a textpage, but not in
Page.getImageList()
.Image blocks in a textpage are generated for every image location – whether or not there are any duplicates. This is in contrast to
Page.getImageList()
, which will contain each image only once.Images mentioned in the page’s
object
definition will always appear inPage.getImageList()
1. But it may happen, that there is no “display” command in the page’scontents
(erroneously or on purpose). In this case the image will not appear in the textpage.
Text block:
Key |
Value |
---|---|
type |
0 = text (int) |
bbox |
block rectangle, formatted as tuple(fitz.Rect) |
lines |
list of text line dictionaries |
Line Dictionary¶
Key |
Value |
---|---|
bbox |
line rectangle, formatted as tuple(fitz.Rect) |
wmode |
writing mode (int): 0 = horizontal, 1 = vertical |
dir |
writing direction (list of floats): [x, y] |
spans |
list of span dictionaries |
The value of key “dir” is a unit vetor and should be interpreted as follows:
x: positive = “left-right”, negative = “right-left”, 0 = neither
y: positive = “top-bottom”, negative = “bottom-top”, 0 = neither
The values indicate the “relative writing speed” in each direction, such that x2 + y2 = 1. In other words dir = [cos(beta), sin(beta)], where beta is the writing angle relative to the horizontal.
Span Dictionary¶
Spans contain the actual text. A line contains more than one span only, if it contains text with different font properties.
(Changed in version 1.14.17) Spans now also have a bbox key (again).
Key |
Value |
---|---|
bbox |
span rectangle, formatted as tuple(fitz.Rect) |
font |
font name (str) |
size |
font size (float) |
flags |
font characteristics (int) |
color |
text color in sRGB format (int) |
text |
(only for |
chars |
(only for |
(New in version 1.16.0)
“color” is the text color encoded in sRGB format, e.g. 0xFF0000 for red.
“flags” is an integer, encoding bools of font properties:
bit 0: superscripted (20)
bit 1: italic (21)
bit 2: serifed (22)
bit 3: monospaced (23)
bit 4: bold (24)
Test these characteristics like so:
>>> if flags & 2**1: print("italic")
>>> # etc.
Character Dictionary for extractRAWDICT()
¶
We are currently providing the bbox in rect_like
format. In a future version, we might change that to quad_like
. This image shows the relationship between items in the following table:
Key |
Value |
---|---|
origin |
tuple coordinates of the character’s bottom left point |
bbox |
character rectangle, formatted as tuple(fitz.Rect) |
c |
the character (unicode) |
Footnotes
- 1
Image specifications for a PDF page are done in the page’s sub-dictionary /Resources. Being a text format specification, PDF does not prevent one from having arbitrary image entries in this dictionary – whether actually in use by the page or not. On top of this, resource dictionaries can be inherited from the page’s parent object – like a node of the PDF’s
pagetree
or thecatalog
object. So the PDF creator may e.g. define one file level /Resources naming all images and fonts ever used by any page. In this case,Page.getImageList()
andPage.getFontList()
will always return the same lists for all pages.