Skip to content

Commit c38223c

Browse files
authored
Merge branch 'main' into markdown-export
2 parents 9a16da9 + d990d7c commit c38223c

File tree

19 files changed

+371
-107
lines changed

19 files changed

+371
-107
lines changed

.github/ISSUE_TEMPLATE/bug_report.yml

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,9 @@ body:
4646
attributes:
4747
label: PyMuPDF version
4848
options:
49+
- 1.26.1
4950
- 1.26.0
50-
- 1.25.5
51-
- 1.25.4
52-
- 1.25.3
53-
- 1.25.2
54-
- 1.25.1
55-
- 1.25.0
56-
- 1.24.x or earlier
51+
- 1.25.x or earlier
5752
- Built from source
5853
description: |
5954
* For example from `pymupdf.VersionBind`.

.github/workflows/test.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,5 +58,7 @@ jobs:
5858
#
5959
- uses: actions/upload-artifact@v4
6060
with:
61-
path: ./wheelhouse/pymupdf*.whl
61+
path: |
62+
wheelhouse/pymupdf*.whl
63+
wheelhouse/pymupdf*.tar.gz
6264
name: artifact-${{ matrix.os }}

changes.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,27 @@ Change Log
22
==========
33

44

5+
**Changes in version 1.26.1**
6+
7+
* Use MuPDF-1.26.2.
8+
9+
* Fixed issues:
10+
11+
* **Fixed** `4520 <https://github.com/pymupdf/PyMuPDF/issues/4520>`_: show_pdf_page does not like empty pages created by new_page
12+
* **Fixed** `4524 <https://github.com/pymupdf/PyMuPDF/issues/4524>`_: fitz.get_text ignores 'pages' kwarg
13+
* **Fixed** `4412 <https://github.com/pymupdf/PyMuPDF/issues/4412>`_: Regression? Spurious error? in insert_pdf in v1.25.4
14+
15+
* Other:
16+
17+
* Partial fix for `4503 <https://github.com/pymupdf/PyMuPDF/issues/4503>`_: Undetected character styles
18+
* New method `Document.rewrite_images()`, useful for reducing file size, changing image formats, or converting color spaces.
19+
* `Page.get_text()`: restrict positional args to match docs.
20+
* Removed bogus definition of class `Shape`.
21+
* Removed release date from module, docs and changelog.
22+
* `pymupdf.pymupdf_date` and `pymupdf.VersionDate` are now both None.
23+
* They will be removed in a future release.
24+
25+
526
**Changes in version 1.26.0 (2025-05-22)**
627

728
* Use MuPDF-1.26.1.

docs/page.rst

Lines changed: 98 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -2329,14 +2329,36 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
23292329
====================================== =====================================
23302330
**Document Level** **Page Level**
23312331
====================================== =====================================
2332-
*Document.get_page_fonts(pno)* :meth:`Page.get_fonts`
2333-
*Document.get_page_images(pno)* :meth:`Page.get_images`
2334-
*Document.get_page_pixmap(pno, ...)* :meth:`Page.get_pixmap`
2335-
*Document.get_page_text(pno, ...)* :meth:`Page.get_text`
2336-
*Document.search_page_for(pno, ...)* :meth:`Page.search_for`
2332+
:meth:`Document.get_page_fonts` :meth:`Page.get_fonts`
2333+
:meth:`Document.get_page_images` :meth:`Page.get_images`
2334+
:meth:`Document.get_page_pixmap` :meth:`Page.get_pixmap`
2335+
:meth:`Document.get_page_text` :meth:`Page.get_text`
2336+
:meth:`Document.search_page_for` :meth:`Page.search_for`
23372337
====================================== =====================================
23382338

2339-
The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
2339+
.. note::
2340+
2341+
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
2342+
2343+
However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: `page.get_fonts` == `page.parent.get_page_fonts(page.number)`.
2344+
2345+
2346+
When calling the :ref:`Document` equivalent methods then the page number is sent through as a parameter, e.g.:
2347+
2348+
`Document.get_page_images(pno)` or `Document.get_page_text(pno)`
2349+
2350+
.. tip::
2351+
2352+
The page number parameter, ``pno``, is a 0-based integer `-∞ < pno < page_count`.
2353+
2354+
2355+
2356+
2357+
2358+
Tables and Related Classes
2359+
------------------------------------
2360+
2361+
The `TableFinder` class is returned by :meth:`Page.find_tables` and has related classes as follows:
23402362

23412363
.. note::
23422364

@@ -2351,7 +2373,7 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
23512373

23522374
.. attribute:: tables
23532375

2354-
A list of `Table` objects, each of which represents a table found on the page. Empty list if no table found.
2376+
A list of :class:`Table` objects, each of which represents a table found on the page. An empty list if no tables are found.
23552377

23562378
.. attribute:: page
23572379

@@ -2361,93 +2383,124 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
23612383

23622384
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of all table cells (in any tables) found on the page. Note that cells may also be ``None`` objects, which are created to enforce a complete rows x columns structure for the affected table.
23632385

2386+
:type: :ref:`Page`
2387+
23642388

23652389
.. class:: Table
23662390

2367-
An object representing a table found on the page. Attributes of interest:
2391+
An object representing a table found on the page.
23682392

2369-
.. attribute:: bbox
23702393

2371-
The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
2394+
.. attribute:: page
2395+
2396+
A back-reference to the owning page.
2397+
2398+
:type: :ref:`Page`
23722399

23732400
.. attribute:: cells
23742401

2375-
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of the cells in the table. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
2402+
An array of `Rect` objects for each cell in the table.
23762403

2377-
.. attribute:: rows
2404+
:type: list
23782405

2379-
A list of :ref:`TableRow` objects, each of which represents a row in the table. The order of rows is the same as in the original table. If the table has no rows, this will be an empty list.
23802406

2381-
.. attribute:: col_count
2407+
.. attribute:: header
2408+
2409+
A `TableHeader` object.
2410+
2411+
:type: `TableHeader`
2412+
2413+
2414+
.. attribute:: bbox
2415+
2416+
The bounding box of all cells of the table header.
2417+
2418+
2419+
:type: :ref:`Rect`
2420+
23822421

2383-
The number of columns in the table (integer).
23842422

23852423
.. attribute:: row_count
23862424

2387-
The number of rows in the table (integer).
2425+
Number of rows in the table.
23882426

2389-
.. method:: extract
2427+
:type: int
23902428

2391-
Returns a (row-major) list of lists representing the plain text of the table cells. Each sublist contains the text of one row, and each item in that sublist is the text of one cell in that row. So, `Table.extract()[i][j]` will return the text of the cell in row ``i`` and column ``j``. If a cell is empty, the corresponding item will be an empty string. If the corresponding boundary box is ``None``, the item will also be ``None``.
23922429

2393-
.. method:: to_markdown(clean=False, fill_empty=True)
2430+
.. attribute:: col_count
23942431

2395-
Returns a string in `GitHub Markdown format <https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables>`_ representing the table. The string will contain a header line with column names, followed by a separator line, and then the rows of the table. The text of each cell will be enclosed in pipe characters `|`, and each row will be separated by a newline character `\n`. **Line breaks inside a cell** are being replaced by the HTML `<br>` tag. Bold, italic, mono-spaced and strikethrough text will be styled according to the corresponding Markdown syntax.
2432+
Number of columns in the table.
23962433

2397-
- Bold text will be enclosed in double asterisks ``"**"``.
2434+
:type: int
23982435

2399-
- Italic text will be enclosed in single underscore ``"_"``.
24002436

2401-
- Mono-spaced text will be enclosed in backticks ``"`"``.
2437+
.. attribute:: rows
24022438

2403-
- Strikethrough text will be enclosed in double tildes ``"~~"``.
2404-
2405-
:arg bool clean: if ``True``, any hyphen "-" in the text is replaced by a ``"&#45;"`` character.
2406-
2407-
:arg bool fill_empty: if ``True``, empty cells will be filled with a copy of neighboring cells in an effort to indicate potential column and row spans.
2439+
An array of `TableRow` objects for each row in the table.
24082440

2409-
* For each row and starting with index 1, the cell content will be replaced with the content of its left neighbor if it is ``None``.
2441+
:type: list
24102442

2411-
* For each column and starting with index 1, the cell content will be replaced with the content of its upper neighbor if it is ``None``.
24122443

2444+
.. method:: extract()
24132445

2414-
.. method:: to_pandas()
2446+
Extracts table cell text data into a list.
24152447

2416-
Returns a `pandas.DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ representing the table. The DataFrame offers a plethora of functions, among them conversion to 20+ file formats (CSV, markdown, JSON, Excel, HD5 etc.). Where necessary, the table can be refined in multiple ways (e.g. deleting empty rows or columns) and mutliple DataFrames can be joined.
2448+
:type: list
24172449

2450+
.. method:: to_markdown(clean=False, fill_empty=True)
24182451

2419-
.. class:: TableHeader
2452+
Extracts table data into Markdown text format.
24202453

2421-
.. attribute:: names
24222454

2423-
A list of strings representing the column names of the `Table`. This is usually the text content of the top row cells, but may instead be content identified above the detected table. The respective situation is encoded in the following attribute.
2455+
:arg bool clean: If ``True`` then markdown syntax is removed from cell content.
2456+
:arg bool fill_empty: If ``True`` then cell content `None` is replaced by the values above (columns) or left (rows) in an effort to approximate row and columns spans.
24242457

2425-
.. attribute:: is_external
24262458

2427-
Whether the header is part of the originally detected table (``False``) or was identified above the table (``True``). If ``True``, the header is not part of the table, but is used to identify the columns in the table. In this case, the header text will be used as column names in the extracted data.
2459+
:type: string
24282460

2429-
.. attribute:: bbox
24302461

2431-
The bounding box of the header given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the header. If the header is not part of the table, this will be the rectangle that contains all cells of the header text, otherwise it is equal of the top row's boundary box.
2462+
.. method:: to_pandas()
24322463

2433-
.. attribute:: cells
2464+
Return a `pandas DataFrame <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_ version of the table.
24342465

2435-
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in the header. Note that cells may also be ``None``, which will happen to prevent any gaps in a rows x columns structure. If the header is not part of the table, this will be the bounding boxes of the header text.
2466+
:type: pandas DataFrame
24362467

24372468

2438-
.. class:: TableRow
2469+
.. class:: TableHeader
24392470

2440-
An object defining a row in a `Table` found on the page. Attributes of interest:
2471+
Dedicated class for table headers.
24412472

24422473
.. attribute:: bbox
24432474

2444-
The bounding box of the row given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the row.
2475+
The bounding box of the union of cells belonging to the table header, given as a tuple (x0, y0, x1, y1). This rectangle contains all table header cells.
2476+
2477+
:type: :ref:`Rect`
24452478

24462479
.. attribute:: cells
24472480

2448-
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in this row. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
2481+
A list of tuples for each bbox of a column header.
2482+
2483+
:type: list
2484+
2485+
.. attribute:: names
2486+
2487+
A list of strings with column header text.
2488+
2489+
:type: list
2490+
2491+
.. attribute:: external
2492+
2493+
A boolean indicating whether the header is outside the table cells.
2494+
2495+
:type: `bool`
2496+
2497+
2498+
.. class:: TableRow
2499+
2500+
Dedicated class for table rows.
24492501

24502502

2503+
----
24512504

24522505

24532506
.. rubric:: Footnotes

docs/vars.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,8 @@ Constants
7777

7878
.. py:data:: pymupdf_date
7979
80-
ISO timestamp *YYYY-MM-DD HH:MM:SS* when these bindings were built.
81-
82-
:type: string
83-
80+
Disabled (set to None) in 1.26.1.
81+
8482
.. py:data:: version
8583
8684
(pymupdf_version, mupdf_version, timestamp) -- combined version information where `timestamp` is the generation point in time formatted as "YYYYMMDDhhmmss".
@@ -97,7 +95,7 @@ Constants
9795

9896
.. py:data:: VersionDate
9997
100-
Legacy equivalent to `mupdf_version`.
98+
Disabled (set to None) in 1.26.1.
10199

102100

103101
.. _PermissionCodes:

docs/version.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
----
22

3-
This documentation covers **PyMuPDF v1.26.0** features as of **2025-05-22 00:00:01**.
3+
This documentation covers **PyMuPDF v1.26.1**.
44

55
The major and minor versions of |PyMuPDF| and |MuPDF| will always be the same. Only the third qualifier (patch level) may deviate from that of |MuPDF|.
66

0 commit comments

Comments
 (0)