Skip to content

Commit f57cb6b

Browse files
committed
Remove dependency on MuPDF version
Add all MuPDF STEXT flags up to v1.26.0 to PyMuPDF. Use hard coded values if unknown in an earlier MuPDF version that we still want / need to support. The intention is to switch to MuPDF's symbolic names as soon as we drop support of the corresponding version. Flag bits representing current MuPDF features can always be used because the are ignored by older MuPDF versions. Also removed some duplicate definitions.
1 parent 104051b commit f57cb6b

File tree

2 files changed

+44
-24
lines changed

2 files changed

+44
-24
lines changed

docs/vars.rst

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ For the PyMuPDF programmer, some combination (using Python's `|` operator, or si
253253

254254
.. py:data:: TEXT_COLLECT_STRUCTURE
255255
256-
256 -- Not supported.
256+
256 -- Not supported yet.
257257

258258
.. py:data:: TEXT_ACCURATE_BBOXES
259259
@@ -264,17 +264,45 @@ For the PyMuPDF programmer, some combination (using Python's `|` operator, or si
264264

265265
.. py:data:: TEXT_COLLECT_VECTORS
266266
267-
1024 -- Not supported.
267+
1024 -- Not supported yet.
268268

269269
.. py:data:: TEXT_IGNORE_ACTUALTEXT
270270
271-
2048 -- Ignore built-in differences between text appearing in e.g. PDF viewers versus text stored in the PDF. See :ref:`AdobeManual`, page 615 for background. If set, the **stored** ("replacement" text) is ignored in favor of the displayed text.
271+
2048 -- Ignore built-in differences between text appearing in e.g. PDF viewers versus text stored in the PDF. See :ref:`AdobeManual`, page 615 for background. If set, the **stored** ("replacement" text) is ignored in favor of the **displayed** text.
272272

273273
.. py:data:: TEXT_STEXT_SEGMENT
274274
275275
4096 -- Attempt to segment page into different regions.
276276

277-
The following constants represent the default combinations of the above for text extraction and searching:
277+
.. py:data:: TEXT_STEXT_PARAGRAPH_BREAK
278+
279+
8192 -- Not supported yet.
280+
281+
.. py:data:: TEXT_STEXT_TABLE_HUNT
282+
283+
16384 -- Not supported yet.
284+
285+
.. py:data:: TEXT_COLLECT_STYLES
286+
287+
32768 -- Detect underlined and strikeout text. Also detect and handle faked bold text in most cases.
288+
289+
.. py:data:: TEXT_GID_FOR_UNKNOWN_UNICODE
290+
291+
65536 -- An alternative to `TEXT_CID_FOR_UNKNOWN_UNICODE` that uses the GID (glyph ID) instead of the CID (character ID). Both flags should never be used together, because results are undefined.
292+
293+
.. py:data:: TEXT_CLIP_RECT
294+
295+
1 << 17 -- Not supported yet.
296+
297+
.. py:data:: TEXT_ACCURATE_ASCENDERS
298+
299+
1 << 18 -- Not supported yet.
300+
301+
.. py:data:: TEXT_ACCURATE_SIDE_BEARINGS
302+
303+
1 << 19 -- Not supported yet.
304+
305+
The following constants represent default combinations of the above for text extraction and searching:
278306

279307
.. py:data:: TEXTFLAGS_TEXT
280308

src/__init__.py

Lines changed: 12 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -13516,18 +13516,18 @@ def width(self):
1351613516
TEXT_PRESERVE_SPANS = mupdf.FZ_STEXT_PRESERVE_SPANS
1351713517
TEXT_MEDIABOX_CLIP = mupdf.FZ_STEXT_MEDIABOX_CLIP
1351813518
TEXT_CID_FOR_UNKNOWN_UNICODE = mupdf.FZ_STEXT_USE_CID_FOR_UNKNOWN_UNICODE
13519-
if mupdf_version_tuple >= (1, 25):
13520-
TEXT_COLLECT_STRUCTURE = mupdf.FZ_STEXT_COLLECT_STRUCTURE
13521-
TEXT_ACCURATE_BBOXES = mupdf.FZ_STEXT_ACCURATE_BBOXES
13522-
TEXT_COLLECT_VECTORS = mupdf.FZ_STEXT_COLLECT_VECTORS
13523-
TEXT_IGNORE_ACTUALTEXT = mupdf.FZ_STEXT_IGNORE_ACTUALTEXT
13524-
TEXT_STEXT_SEGMENT = mupdf.FZ_STEXT_SEGMENT
13525-
else:
13526-
TEXT_COLLECT_STRUCTURE = 256
13527-
TEXT_ACCURATE_BBOXES = 512
13528-
TEXT_COLLECT_VECTORS = 1024
13529-
TEXT_IGNORE_ACTUALTEXT = 2048
13530-
TEXT_STEXT_SEGMENT = 4096
13519+
TEXT_COLLECT_STRUCTURE = 256 # mupdf.FZ_STEXT_COLLECT_STRUCTURE
13520+
TEXT_ACCURATE_BBOXES = 512 # mupdf.FZ_STEXT_ACCURATE_BBOXES
13521+
TEXT_COLLECT_VECTORS = 1024 # mupdf.FZ_STEXT_COLLECT_VECTORS
13522+
TEXT_IGNORE_ACTUALTEXT = 2048 # mupdf.FZ_STEXT_IGNORE_ACTUALTEXT
13523+
TEXT_STEXT_SEGMENT = 4096 # mupdf.FZ_STEXT_SEGMENT
13524+
TEXT_STEXT_PARAGRAPH_BREAK = 8192 # mupdf.FZ_STEXT_PARAGRAPH_BREAK
13525+
TEXT_STEXT_TABLE_HUNT = 16384 # mupdf.FZ_STEXT_TABLE_HUNT
13526+
TEXT_COLLECT_STYLES = 32768 # mupdf.FZ_STEXT_COLLECT_STYLES
13527+
TEXT_GID_FOR_UNKNOWN_UNICODE = 65536 # mupdf.FZ_STEXT_USE_GID_FOR_UNKNOWN_UNICODE
13528+
TEXT_CLIP_RECT = 1 << 17 # mupdf.FZ_STEXT_CLIP_RECT
13529+
TEXT_ACCURATE_ASCENDERS = 1 << 18 # mupdf.FZ_STEXT_ACCURATE_ASCENDERS
13530+
TEXT_ACCURATE_SIDE_BEARINGS = 1 << 19 # mupdf.FZ_STEXT_ACCURATE_SIDE_BEARINGS
1353113531

1353213532
TEXTFLAGS_WORDS = (0
1353313533
| TEXT_PRESERVE_LIGATURES
@@ -13620,14 +13620,6 @@ def width(self):
1362013620
PDF_BM_Screen = "Screen"
1362113621
PDF_BM_SoftLight = "Softlight"
1362213622

13623-
# General text flags
13624-
TEXT_FONT_SUPERSCRIPT = 1
13625-
TEXT_FONT_ITALIC = 2
13626-
TEXT_FONT_SERIFED = 4
13627-
TEXT_FONT_MONOSPACED = 8
13628-
TEXT_FONT_BOLD = 16
13629-
13630-
1363113623
annot_skel = {
1363213624
"goto1": lambda a, b, c, d, e: f"<</A<</S/GoTo/D[{a} 0 R/XYZ {_format_g((b, c, d))}]>>/Rect[{e}]/BS<</W 0>>/Subtype/Link>>",
1363313625
"goto2": lambda a, b: f"<</A<</S/GoTo/D{a}>>/Rect[{b}]/BS<</W 0>>/Subtype/Link>>",

0 commit comments

Comments
 (0)