Skip to content

Commit 88881d6

Browse files
JorjMcKiejamie-lemon
authored andcommitted
Table Improvements
Document improvements to the table module.
1 parent 96d2ebe commit 88881d6

File tree

1 file changed

+43
-3
lines changed

1 file changed

+43
-3
lines changed

docs/page.rst

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -451,7 +451,7 @@ In a nutshell, this is what you can do with PyMuPDF:
451451

452452
:arg bool final_filter: If `True` (default), the method will to remove rectangles having width or height smaller than the respective tolerance value. If `False` no such filtering is done.
453453

454-
.. method:: find_tables(clip=None, strategy=None, vertical_strategy=None, horizontal_strategy=None, vertical_lines=None, horizontal_lines=None, snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None, join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None, edge_min_length=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=None, intersection_x_tolerance=None, intersection_y_tolerance=None, text_tolerance=None, text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
454+
.. method:: find_tables(clip=None, strategy=None, vertical_strategy=None, horizontal_strategy=None, vertical_lines=None, horizontal_lines=None, snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None, join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None, edge_min_length=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=None, intersection_x_tolerance=None, intersection_y_tolerance=None, text_tolerance=None, text_x_tolerance=None, text_y_tolerance=None, add_lines=None, add_boxes=None, paths=None)
455455

456456
Find tables on the page and return an object with related information. Typically, the default values of the many parameters will be sufficient. Adjustments should ever only be needed in corner case situations.
457457

@@ -485,7 +485,11 @@ In a nutshell, this is what you can do with PyMuPDF:
485485

486486
:arg float text_tolerance: Characters will be combined into words only if their distance is no larger than this value (points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `text_x_tolerance` and `text_y_tolerance`.
487487

488-
:arg tuple,list add_lines: Specify a list of "lines" (i.e. pairs of :data:`point_like` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersectiing, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
488+
:arg tuple,list add_lines: Specify a list of "lines" (i.e. pairs of :data:`point_like` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersecting, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
489+
490+
:arg tuple,list add_boxes: Specify a list of rectangles (:data:`rect_like` objects) as **additional**, "virtual" vector graphics. These rectangles may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These rectangles will be treated exactly like "real" vector graphics in terms of joining, snapping, intersecting, minimum length and containment in the `clip` rectangle.
491+
492+
:arg list paths: list of vector graphics in the format as returned be :meth:`Page.get_drawings`. Using this parameter will prevent the method to extract vector graphics itself. This is useful if the vector graphics are already available. This can save execution time significantly.
489493

490494
.. image:: images/img-findtables.*
491495

@@ -500,7 +504,7 @@ In a nutshell, this is what you can do with PyMuPDF:
500504
* ``bbox``: the bounding box of the table as a tuple `(x0, y0, x1, y1)`.
501505
* ``cells``: bounding boxes of the table's cells (list of tuples). A cell may also be `None`.
502506
* ``extract()``: this method returns the text content of each table cell as a list of list of strings.
503-
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye.
507+
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Markdown viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye. Any line breaks (`\n`) in cells are replaced by HTML line breaks tags `<br>`.
504508
* `to_pandas()`: this method returns the table as a `pandas <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_. DataFrames are very versatile objects allowing a plethora of table manipulation methods and outputs to almost 20 well-known formats, among them Excel files, CSV, JSON, markdown-formatted tables and more. `DataFrame.to_markdown()` generates a Github-compatible markdown format optimized for human readability. This method however requires the package `tabulate <https://pypi.org/project/tabulate/>`_ to be installed in addition to pandas itself.
505509
* ``header``: a `TableHeader` object containing header information of the table.
506510
* ``col_count``: an integer containing the number of table columns.
@@ -2334,6 +2338,42 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
23342338

23352339
The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
23362340

2341+
2342+
.. class:: TableFinder
2343+
2344+
An object always returned by :meth:`Page.find_tables`. Attributes of interest:
2345+
2346+
... attribute:: tables
2347+
2348+
A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
2349+
2350+
... attribute:: page
2351+
2352+
A reference to the :ref:`Page` object.
2353+
2354+
2355+
.. class:: Table
2356+
2357+
An object representing a table found on the page. Attributes of interest:
2358+
2359+
.. attribute:: bbox
2360+
2361+
The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
2362+
2363+
2364+
2365+
.. attribute:: cells
2366+
2367+
2368+
2369+
.. class:: TableHeader
2370+
2371+
.. class:: TableRow
2372+
2373+
2374+
2375+
2376+
23372377
.. note::
23382378

23392379
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.

0 commit comments

Comments
 (0)