Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 31 additions & 4 deletions docs/textpage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,21 @@ For a description of what this class is all about, see Appendix 2.

Textpage content as a list of text lines grouped by block. Each list items looks like this::

(x0, y0, x1, y1, "lines in the block", block_no, block_type)
``(x0, y0, x1, y1, "lines in the block", block_no, block_type)``

The first four entries are the block's bbox coordinates, *block_type* is 1 for an image block, 0 for text. *block_no* is the block sequence number. Multiple text lines are joined via line breaks.
The first four entries are the block's bbox coordinates, *block_type* is 1 for an image block, 3 for a vector block, and 0 for text. *block_no* is the block sequence number. Multiple text lines are joined via line breaks.

For an image block, its bbox and a text line with some image meta information is included -- **not the image content**.
For an **image block**, its bbox and a text line with some image meta information is included -- not the image **content**. Image blocks are included only if the extraction flag bit :data:`TEXT_PRESERVE_IMAGES` is set. An image block tuple will look like this::

``(x0, y0, x1, y1, "<image: colorspace-name, w: width, h: height, bpc: bits_per_component>\n", block_no, 1)``

For a **vector block**, the following item will be included. Vector blocks are included only if the extraction flag bit :data:`TEXT_COLLECT_VECTORS` is set. A vector block tuple will look like this::

``(x0, y0, x1, y1, "<vector stroked, color: #rrggbb, alpha: 255, is-rect: true, continues: false>\n", block_no, 3)``

The keyword "vector" is followed by either "stroked" or "filled". The color is given in HTML (hexadecimal RGB) format. Property ``is-rect`` is true, if the vector is not a curve and parallel to the x- or y-axis. So in essence is either a real rectangle or a line segment. Property ``continues`` indicates whether the vector is part of a path (and not the first item).

.. note:: When no further details are needed (as provided by :meth:`Page.get_drawings`), then this is an **inexpensive** way to extract basic vector graphics information. Another major advantage is that all block types (text, images and vectors) are included in the output in the same order as they are present in the page's :data:`contents` stream.

This is a high-speed method with just enough information to output plain text in desired reading sequence.

Expand Down Expand Up @@ -200,7 +210,24 @@ blocks *list* of block dictionaries

Block Dictionaries
~~~~~~~~~~~~~~~~~~
Block dictionaries come in two different formats for **image blocks** and for **text blocks**.
Block dictionaries come in different formats for **vector blocks**, **image blocks** and **text blocks**. Vector blocks are included only if the extraction flag bit :data:`TEXT_COLLECT_VECTORS` is set. Image blocks are included only if the extraction flag bit :data:`TEXT_PRESERVE_IMAGES` is set.

**Vector block:**

=============== =========================================================================================================================
**Key** **Value**
=============== =========================================================================================================================
type 3 = vector (``int``)
bbox vector bbox on page (:data:`rect_like`)
number block count (``int``)
stroked either stroked (``True``) or filled (``False``) (``bool``)
isrect whether the vector is axis-parallel (``bool``). Can be a line or a rectangle. Curves or diagonal lines are ``False``.
continues whether the vector is (not the last) part of a sequence of vectors in a *path* (``bool``).
color sRGB integer, e.g. 0xRRGGBB (``int``).
alpha Transparency, a value in ``range(256)`` (``int``).
=============== =========================================================================================================================

This information is a true subset of the output of :meth:`Page.get_drawings`. Its advantage is its speed (because it is extracted alongside one :ref:`TextPage` creation) and the fact that vector blocks are included in the overall page content sequence together with text and images.

**Image block:**

Expand Down
15 changes: 9 additions & 6 deletions docs/vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -251,26 +251,29 @@ For the PyMuPDF programmer, some combination (using Python's `|` operator, or si

.. py:data:: TEXT_COLLECT_STRUCTURE

256 -- Not supported.
256 -- Extract or generate the :ref:`Document` structure. Detail documentation pending.

.. py:data:: TEXT_ACCURATE_BBOXES

512 -- Ignore metric values of all fonts when computing character boundary boxes -- most prominently the `ascender <https://en.wikipedia.org/wiki/Ascender_(typography)>`_ and `descender <https://en.wikipedia.org/wiki/Descender>`_ values. Instead, follow the drawing commands of each character's glyph and compute its rectangle hull. This is the smallest rectangle wrapping all points used for drawing the visual appearance - see the :ref:`Shape` class for understanding the background. This will especially result in individual character heights. For instance a (white) space will have a **bbox of height 0** (because nothing is drawn) -- in contrast to the non-zero boundary box generated when using font metrics. This option may be useful to cope with getting meaningful boundary boxes even for fonts containing errors. Its use will slow down text extraction somewhat because of the incurred computational effort.
512 -- Ignore metric values of all fonts when computing character boundary boxes -- most prominently the `ascender <https://en.wikipedia.org/wiki/Ascender_(typography)>`_ and `descender <https://en.wikipedia.org/wiki/Descender>`_ values. Instead, follow the drawing commands of each character's glyph and compute their rectangle hull as the bbox. This is the smallest rectangle wrapping all points used for drawing the visual appearance - see the :ref:`Shape` class for understanding the background. This will especially result in individual character heights. For instance a (white) space will have a **bbox of zero height** (because nothing is drawn) -- in contrast to the non-zero boundary box generated when using font metrics. This option may be useful to cope with failures of getting meaningful boundary boxes, even for fonts containing errors. Its use will slow down text extraction somewhat because of the incurred computational effort.

Note that this has no effect by default - one must also disable the global
quad corrections setting with `pymupdf.TOOLS.unset_quad_corrections(True)`.
Note that this has no effect by default - one must also disable the global quad corrections setting with `pymupdf.TOOLS.unset_quad_corrections(True)`.

.. py:data:: TEXT_COLLECT_VECTORS

1024 -- Not supported.
1024 -- Collect vector drawings into the :ref:`TextPage`. These are stored as blocks alongside text and image blocks, depending on other extraction flags. See :meth:`TextPage.extractBLOCKS` and :meth:`TextPage.extractDICT` for details. Beyond these two methods, vector graphics extraction is also available for :meth:`TextPage.extractJSON`, :meth:`TextPage.extractRAWDICT`, :meth:`TextPage.extractRAWJSON` and :meth:`TextPage.extractXML`.

.. py:data:: TEXT_IGNORE_ACTUALTEXT

2048 -- Ignore built-in differences between text appearing in e.g. PDF viewers versus text stored in the PDF. See :ref:`AdobeManual`, page 615 for background. If set, the **stored** ("replacement" text) is ignored in favor of the displayed text.

.. py:data:: TEXT_SEGMENT

4096 -- Attempt to segment page into different regions.
4096 -- Attempt to segment page into different regions. Detail documentation pending.

.. py:data:: TEXT_COLLECT_STYLES

32768 -- Request collecting text **decoration** properties. This includes text underlining and strikeout. In contrast to public awareness, these are not font properties, but are drawn separately as vector graphics or annotations on top of the text. In addition, the flag bit will also cause MuPDF to detect "fake bold" text. In many cases, Document creators **simulate bold** text by printing the same text multiple times with slight offsets. If this flag is set, such text will be marked as bold in the resulting text spans.

The following constants represent the default combinations of the above for text extraction and searching:

Expand Down