You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/page.rst
+43-3Lines changed: 43 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -451,7 +451,7 @@ In a nutshell, this is what you can do with PyMuPDF:
451
451
452
452
:arg bool final_filter: If `True` (default), the method will to remove rectangles having width or height smaller than the respective tolerance value. If `False` no such filtering is done.
Find tables on the page and return an object with related information. Typically, the default values of the many parameters will be sufficient. Adjustments should ever only be needed in corner case situations.
457
457
@@ -485,7 +485,11 @@ In a nutshell, this is what you can do with PyMuPDF:
485
485
486
486
:arg float text_tolerance: Characters will be combined into words only if their distance is no larger than this value (points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `text_x_tolerance` and `text_y_tolerance`.
487
487
488
-
:arg tuple,list add_lines: Specify a list of "lines" (i.e. pairs of :data:`point_like` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersectiing, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
488
+
:arg tuple,list add_lines: Specify a list of "lines" (i.e. pairs of :data:`point_like` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersecting, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
489
+
490
+
:arg tuple,list add_boxes: Specify a list of rectangles (:data:`rect_like` objects) as **additional**, "virtual" vector graphics. These rectangles may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These rectangles will be treated exactly like "real" vector graphics in terms of joining, snapping, intersecting, minimum length and containment in the `clip` rectangle.
491
+
492
+
:arg list paths: list of vector graphics in the format as returned be :meth:`Page.get_drawings`. Using this parameter will prevent the method to extract vector graphics itself. This is useful if the vector graphics are already available. This can save execution time significantly.
489
493
490
494
.. image:: images/img-findtables.*
491
495
@@ -500,7 +504,7 @@ In a nutshell, this is what you can do with PyMuPDF:
500
504
* ``bbox``: the bounding box of the table as a tuple `(x0, y0, x1, y1)`.
501
505
* ``cells``: bounding boxes of the table's cells (list of tuples). A cell may also be `None`.
502
506
* ``extract()``: this method returns the text content of each table cell as a list of list of strings.
503
-
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye.
507
+
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Markdown viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye. Any line breaks (`\n`) in cells are replaced by HTML line breaks tags `<br>`.
504
508
* `to_pandas()`: this method returns the table as a `pandas <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_. DataFrames are very versatile objects allowing a plethora of table manipulation methods and outputs to almost 20 well-known formats, among them Excel files, CSV, JSON, markdown-formatted tables and more. `DataFrame.to_markdown()` generates a Github-compatible markdown format optimized for human readability. This method however requires the package `tabulate <https://pypi.org/project/tabulate/>`_ to be installed in addition to pandas itself.
505
509
* ``header``: a `TableHeader` object containing header information of the table.
506
510
* ``col_count``: an integer containing the number of table columns.
@@ -2334,6 +2338,42 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
2334
2338
2335
2339
The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
2336
2340
2341
+
2342
+
.. class:: TableFinder
2343
+
2344
+
An object always returned by :meth:`Page.find_tables`. Attributes of interest:
2345
+
2346
+
... attribute:: tables
2347
+
2348
+
A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
2349
+
2350
+
... attribute:: page
2351
+
2352
+
A reference to the :ref:`Page` object.
2353
+
2354
+
2355
+
.. class:: Table
2356
+
2357
+
An object representing a table found on the page. Attributes of interest:
2358
+
2359
+
.. attribute:: bbox
2360
+
2361
+
The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
2362
+
2363
+
2364
+
2365
+
.. attribute:: cells
2366
+
2367
+
2368
+
2369
+
.. class:: TableHeader
2370
+
2371
+
.. class:: TableRow
2372
+
2373
+
2374
+
2375
+
2376
+
2337
2377
.. note::
2338
2378
2339
2379
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
0 commit comments