Skip to content

Commit 8657e4e

Browse files
committed
Update Table "to_markdown" information
1 parent 07eb879 commit 8657e4e

File tree

2 files changed

+6
-4
lines changed

2 files changed

+6
-4
lines changed

docs/page.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -492,7 +492,7 @@ In a nutshell, this is what you can do with PyMuPDF:
492492
* ``bbox``: the bounding box of the table as a tuple `(x0, y0, x1, y1)`.
493493
* ``cells``: bounding boxes of the table's cells (list of tuples). A cell may also be `None`.
494494
* ``extract()``: this method returns the text content of each table cell as a list of list of strings.
495-
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye.
495+
* ``to_markdown()``: this method returns the table as a **string in markdown format** compatible to Github. Supporting viewers can render the string as a table. This output is optimized for **small token sizes**, which is especially beneficial for LLM/RAG feeds. Pandas DataFrame (see method `to_pandas()` below) also offers a markdown output. While better readable for the human eye, it generally is a larger string than produced by the native method.
496496
* `to_pandas()`: this method returns the table as a `pandas <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_. DataFrames are very versatile objects allowing a plethora of table manipulation methods and outputs to almost 20 well-known formats, among them Excel files, CSV, JSON, markdown-formatted tables and more. `DataFrame.to_markdown()` generates a Github-compatible markdown format optimized for human readability. This method however requires the package `tabulate <https://pypi.org/project/tabulate/>`_ to be installed in addition to pandas itself.
497497
* ``header``: a `TableHeader` object containing header information of the table.
498498
* ``col_count``: an integer containing the number of table columns.
@@ -504,11 +504,11 @@ In a nutshell, this is what you can do with PyMuPDF:
504504
* ``bbox``: the bounding box of the header.
505505
* `cells`: a list of bounding boxes containing the name of the respective column.
506506
* `names`: a list of strings containing the text of each of the cell bboxes. They represent the column names -- which are used when exporting the table to pandas DataFrames, markdown, etc.
507-
* `external`: a bool indicating whether the header bbox is outside the table body (`True`) or not. Table headers are never identified by the `TableFinder` logic. Therefore, if `external` is true, then the header cells are not part of any cell identified by `TableFinder`. If `external == False`, then the first table row is the header.
507+
* `external`: a bool indicating whether the header bbox is outside the table body (`True`) or not. Table headers are never identified by the `TableFinder` logic. Therefore, if `external` is true, then the header cells are not part of any cell identified by `TableFinder`. If `external == False`, then the first original table row is the header.
508508

509509
Please have a look at these `Jupyter notebooks <https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis>`_, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages.
510510

511-
.. caution:: The lifetime of the `TableFinder` object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all tables are no longer valid.
511+
.. caution:: The lifetime of the `TableFinder` object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all **table objects are no longer valid.**
512512

513513
The only way to keep table content beyond the page's availability is to **extract it** via methods `Table.to_markdown()`, `Table.to_pandas()` or a copy of `Table.extract()` (e.g. `Table.extract()[:]`).
514514

docs/pymupdf4llm/api.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,16 @@ The |PyMuPDF4LLM| API
1616

1717
Prints the version of the library.
1818

19-
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, dpi: int = 150, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = True) -> str | list[dict]
19+
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, filename=None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, dpi: int = 150, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = True) -> str | list[dict]
2020

2121
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.
2222

2323
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| Document.
2424

2525
:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted all pages are processed.
2626

27+
:arg filename: optional. Use this if you want to provide or override the file name. This may especially be useful when the document is opened from memory streams (which have no name and where thus ``doc.name`` is the empty string). This parameter will be used in all places where normally ``doc.name`` would have been used.
28+
2729
:arg hdr_info: optional. Use this if you want to provide your own header detection logic. This may be a callable or an object having a method named `get_header_id`. It must accept a text span (a span dictionary as contained in :meth:`~.extractDICT`) and a keyword parameter "page" (which is the owning :ref:`Page <page>` object). It must return a string "" or up to 6 "#" characters followed by 1 space. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on them. To completely avoid this behavior specify `hdr_info=lambda s, page=None: ""` or `hdr_info=False`.
2830

2931
:arg bool write_images: when encountering images or vector graphics, images will be created from the respective page area and stored in the specified folder. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if for instance your document has text written on full page images, make sure to set this parameter to `False`.

0 commit comments

Comments
 (0)