Skip to content

Commit 05edf45

Browse files
JorjMcKiejamie-lemon
authored andcommitted
Documentation improvements
Document the use of `TEXT_ACCURATE_BBOXES` and `TEXT_IGNORE_ACTUALTEXT` extraction options.
1 parent b0e0526 commit 05edf45

File tree

2 files changed

+46
-11
lines changed

2 files changed

+46
-11
lines changed

docs/textpage.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -333,16 +333,17 @@ The following shows the original span rectangle in red and the rectangle with re
333333

334334
*"flags"* is an integer, which represents font properties except for the first bit 0. They are to be interpreted like this:
335335

336-
* bit 0: superscripted (2\ :sup:`0`) -- not a font property, detected by MuPDF code.
337-
* bit 1: italic (2\ :sup:`1`)
338-
* bit 2: serifed (2\ :sup:`2`)
339-
* bit 3: monospaced (2\ :sup:`3`)
340-
* bit 4: bold (2\ :sup:`4`)
336+
* bit 0: superscripted (:data:`TEXT_FONT_SUPERSCRIPT`) -- not a font property, detected by MuPDF code.
337+
* bit 1: italic (:data:`TEXT_FONT_ITALIC`)
338+
* bit 2: serifed (:data:`TEXT_FONT_SERIFED`)
339+
* bit 3: monospaced (:data:`TEXT_FONT_MONOSPACED`)
340+
* bit 4: bold (:data:`TEXT_FONT_BOLD`)
341341

342342
Test these characteristics like so:
343343

344-
>>> if flags & 2**1: print("italic")
345-
>>> # etc.
344+
>>> if flags & pymupdf.TEXT_FONT_BOLD & pymupdf.TEXT_FONT_ITALIC:
345+
print(f"{span['text']=} is bold and italic")
346+
346347

347348
Bits 1 thru 4 are font properties, i.e. encoded in the font program. Please note, that this information is not necessarily correct or complete: fonts quite often contain wrong data here.
348349

docs/vars.rst

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
===============================
44
Constants and Enumerations
55
===============================
6-
Constants and enumerations of :title:`MuPDF` as implemented by |PyMuPDF|. Each of the following variables is accessible as *pymupdf.variable*.
6+
Constants and enumerations of :title:`MuPDF` as implemented by |PyMuPDF|. Each of the following values is accessible as `pymupdf.value`.
77

88

99
Constants
@@ -187,9 +187,35 @@ Text Alignment
187187

188188
.. _TextPreserve:
189189

190+
.. _FontProperties:
191+
192+
Font Properties
193+
-----------------------
194+
Please note that the following bits are derived from what a font has to say about its properties. It may not be (and quite often is not) correct.
195+
196+
.. py:data:: TEXT_FONT_SUPERSCRIPT
197+
198+
1 -- the character or span is a superscript. This property is computed by MuPDF and not part of any font information.
199+
200+
.. py:data:: TEXT_FONT_ITALIC
201+
202+
2 -- the font is italic.
203+
204+
.. py:data:: TEXT_FONT_SERIFED
205+
206+
4 -- the font is serifed.
207+
208+
.. py:data:: TEXT_FONT_MONOSPACED
209+
210+
8 -- the font is mono-spaced.
211+
212+
.. py:data:: TEXT_FONT_BOLD
213+
214+
16 -- the font is bold.
215+
190216
Text Extraction Flags
191217
---------------------
192-
Option bits controlling the amount of data, that are parsed into a :ref:`TextPage` -- this class is mainly used only internally in PyMuPDF.
218+
Option bits controlling the amount of data, that are parsed into a :ref:`TextPage`.
193219

194220
For the PyMuPDF programmer, some combination (using Python's `|` operator, or simply use `+`) of these values are aggregated in the ``flags`` integer, a parameter of all text search and text extraction methods. Depending on the individual method, different default combinations of the values are used. Please use a value that meets your situation. Especially make sure to switch off image extraction unless you really need them. The impact on performance and memory is significant!
195221

@@ -219,11 +245,19 @@ For the PyMuPDF programmer, some combination (using Python's `|` operator, or si
219245

220246
.. py:data:: TEXT_MEDIABOX_CLIP
221247
222-
64 -- If set, characters entirely outside a page's **mediabox** will be ignored. This is default in PyMuPDF.
248+
64 -- Characters entirely outside a page's **mediabox** or contained in other "clipped" areas will be ignored. This is default in PyMuPDF.
223249

224250
.. py:data:: TEXT_CID_FOR_UNKNOWN_UNICODE
225251
226-
128 -- If set, use raw character codes instead of U+FFFD. This is the default for **text extraction** in PyMuPDF. If you **want to detect** when encoding information is missing or uncertain, toggle this flag and scan for the presence of U+FFFD (= `chr(0xfffd)`) code points in the resulting text.
252+
128 -- Use raw character codes instead of U+FFFD. This is the default for **text extraction** in PyMuPDF. If you **want to detect** when encoding information is missing or uncertain, toggle this flag and scan for the presence of U+FFFD (= `chr(0xfffd)`) code points in the resulting text.
253+
254+
.. py:data:: TEXT_ACCURATE_BBOXES
255+
256+
512 -- Ignore metric values of all fonts when computing character boundary boxes -- most prominently the `ascender <https://en.wikipedia.org/wiki/Ascender_(typography)>`_ and `descender <https://en.wikipedia.org/wiki/Descender>`_ values. Instead, follow the drawing commands of each character's glyph and compute its rectangle hull. This is the smallest rectangle wrapping all points used for drawing the visual appearance - see the :ref:`Shape` class for understanding the background. This will especially result in individual character heights. For instance a (white) space will have a **bbox of height 0** (because nothing is drawn) -- in contrast to the non-zero boundary box generated when using font metrics. This option may be useful to cope with getting meaningful boundary boxes even for fonts containing errors. Its use will slow down text extraction somewhat because of the incurred computational effort.
257+
258+
.. py:data:: TEXT_IGNORE_ACTUALTEXT
259+
260+
2048 -- Ignore built-in differences between text appearing in e.g. PDF viewers versus text stored in the PDF. See :ref:`AdobeManual`, page 615 for background. If set, the **stored** ("replacement" text) is ignored in favor of the displayed text.
227261

228262

229263
The following constants represent the default combinations of the above for text extraction and searching:

0 commit comments

Comments
 (0)