Table Cell Markdown Support

JorjMcKie · JorjMcKie · commit 9a16da96f918 · 2025-06-11T11:11:52.000-04:00
diff --git a/docs/page.rst b/docs/page.rst
@@ -331,7 +331,7 @@ In a nutshell, this is what you can do with PyMuPDF:
       |history_end|
 
 
-      .. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
+   .. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
 
       **PDF only**: Remove all **content** contained in any redaction rectangle on the page.
 
@@ -2338,18 +2338,28 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
 
 The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
 
+.. note::
+
+   Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
+
+   However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
+
 
 .. class:: TableFinder
 
    An object always returned by :meth:`Page.find_tables`. Attributes of interest:
 
-   ... attribute:: tables
+   .. attribute:: tables
 
-      A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
+      A list of `Table` objects, each of which represents a table found on the page. Empty list if no table found.
 
-   ... attribute:: page
+   .. attribute:: page
 
-      A reference to the :ref:`Page` object.
+      A reference (weakref proxy) to the owning :ref:`Page` object.
+
+   .. attribute:: cells
+
+      A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of all table cells (in any tables) found on the page. Note that cells may also be ``None`` objects, which are created to enforce a complete rows x columns structure for the affected table.
 
 
 .. class:: Table
@@ -2360,25 +2370,85 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
 
       The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.   
 
-   
-
    .. attribute:: cells
 
+      A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of the cells in the table. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
+
+   .. attribute:: rows
+
+      A list of :ref:`TableRow` objects, each of which represents a row in the table. The order of rows is the same as in the original table. If the table has no rows, this will be an empty list.
+
+   .. attribute:: col_count
+
+      The number of columns in the table (integer).
+
+   .. attribute:: row_count
+
+      The number of rows in the table (integer).
+
+   .. method:: extract
+
+      Returns a (row-major) list of lists representing the plain text of the table cells. Each sublist contains the text of one row, and each item in that sublist is the text of one cell in that row. So, `Table.extract()[i][j]` will return the text of the cell in row ``i`` and column ``j``. If a cell is empty, the corresponding item will be an empty string. If the corresponding boundary box is ``None``, the item will also be ``None``.
+
+   .. method:: to_markdown(clean=False, fill_empty=True)
+
+      Returns a string in `GitHub Markdown format <https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables>`_ representing the table. The string will contain a header line with column names, followed by a separator line, and then the rows of the table. The text of each cell will be enclosed in pipe characters `|`, and each row will be separated by a newline character `\n`. **Line breaks inside a cell** are being replaced by the HTML `<br>` tag. Bold, italic, mono-spaced and strikethrough text will be styled according to the corresponding Markdown syntax.
+
+      - Bold text will be enclosed in double asterisks ``"**"``.
+
+      - Italic text will be enclosed in single underscore ``"_"``.
+
+      - Mono-spaced text will be enclosed in backticks ``"`"``.
+
+      - Strikethrough text will be enclosed in double tildes ``"~~"``.
+      
+      :arg bool clean: if ``True``, any hyphen "-" in the text is replaced by a ``"&#45;"`` character.
+      
+      :arg bool fill_empty: if ``True``, empty cells will be filled with a copy of neighboring cells in an effort to indicate potential column and row spans.
+
+         * For each row and starting with index 1, the cell content will be replaced with the content of its left neighbor if it is ``None``.
+
+         * For each column and starting with index 1, the cell content will be replaced with the content of its upper neighbor if it is ``None``.
+
+
+   .. method:: to_pandas()
+
+      Returns a `pandas.DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ representing the table. The DataFrame offers a plethora of functions, among them conversion to 20+ file formats (CSV, markdown, JSON, Excel, HD5 etc.). Where necessary, the table can be refined in multiple ways (e.g. deleting empty rows or columns) and mutliple DataFrames can be joined.
 
 
 .. class:: TableHeader
 
+   .. attribute:: names
+
+      A list of strings representing the column names of the `Table`. This is usually the text content of the top row cells, but may instead be content identified above the detected table. The respective situation is encoded in the following attribute.
+
+   .. attribute:: is_external
+
+      Whether the header is part of the originally detected table (``False``) or was identified above the table (``True``). If ``True``, the header is not part of the table, but is used to identify the columns in the table. In this case, the header text will be used as column names in the extracted data.
+
+   .. attribute:: bbox
+
+      The bounding box of the header given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the header. If the header is not part of the table, this will be the rectangle that contains all cells of the header text, otherwise it is equal of the top row's boundary box.
+
+   .. attribute:: cells
+
+      A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in the header. Note that cells may also be ``None``, which will happen to prevent any gaps in a rows x columns structure. If the header is not part of the table, this will be the bounding boxes of the header text.
+
+
 .. class:: TableRow
 
+   An object defining a row in a `Table` found on the page. Attributes of interest:
 
+   .. attribute:: bbox
 
+      The bounding box of the row given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the row.
 
+   .. attribute:: cells
+
+      A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in this row. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
 
-.. note::
 
-   Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
 
-   However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
 
 .. rubric:: Footnotes
 
diff --git a/src/table.py b/src/table.py
@@ -89,18 +89,130 @@
     Matrix,
     TEXTFLAGS_TEXT,
     TEXT_FONT_BOLD,
+    TEXT_FONT_ITALIC,
+    TEXT_FONT_MONOSPACED,
     TEXT_FONT_SUPERSCRIPT,
+    TEXT_COLLECT_STYLES,
     TOOLS,
     EMPTY_RECT,
     sRGB_to_pdf,
     Point,
     message,
+    mupdf,
 )
 
 EDGES = []  # vector graphics from PyMuPDF
 CHARS = []  # text characters from PyMuPDF
 TEXTPAGE = None
+TEXT_BOLD = mupdf.FZ_STEXT_BOLD
+TEXT_STRIKEOUT = mupdf.FZ_STEXT_STRIKEOUT
+FLAGS = TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES
+
 white_spaces = set(string.whitespace)  # for checking white space only cells
+
+
+def extract_cells(textpage, cell, markdown=False):
+    """Extract text from a rect-like 'cell' as plain or MD style text.
+
+    This function should ultimately be used to extract text from a table cell.
+    Markdown output will only work correctly if extraction flag bit
+    TEXT_COLLECT_STYLES is set.
+
+    Args:
+        textpage: A PyMuPDF TextPage object. Must have been created with
+            TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES.
+        cell: A tuple (x0, y0, x1, y1) defining the cell's bbox.
+        markdown: If True, return text formatted for Markdown.
+
+    Returns:
+        A string with the text extracted from the cell.
+    """
+    text = ""
+    for block in textpage.extractRAWDICT()["blocks"]:
+        if block["type"] != 0:
+            continue
+        block_bbox = block["bbox"]
+        if (
+            0
+            or block_bbox[0] > cell[2]
+            or block_bbox[2] < cell[0]
+            or block_bbox[1] > cell[3]
+            or block_bbox[3] < cell[1]
+        ):
+            continue  # skip block outside cell
+        line_count = len(block["lines"])
+        for line in block["lines"]:
+            lbbox = line["bbox"]
+            if (
+                0
+                or lbbox[0] > cell[2]
+                or lbbox[2] < cell[0]
+                or lbbox[1] > cell[3]
+                or lbbox[3] < cell[1]
+            ):
+                continue  # skip line outside cell
+
+            if text:  # must be a new line in the cell
+                text += "<br>" if markdown else "\n"
+
+            # strikeout detection only works with horizontal text
+            horizontal = line["dir"] == (0, 1) or line["dir"] == (1, 0)
+
+            for span in line["spans"]:
+                sbbox = span["bbox"]
+                if (
+                    0
+                    or sbbox[0] > cell[2]
+                    or sbbox[2] < cell[0]
+                    or sbbox[1] > cell[3]
+                    or sbbox[3] < cell[1]
+                ):
+                    continue  # skip spans outside cell
+
+                # only include chars with more than 50% bbox overlap
+                span_text = ""
+                for char in span["chars"]:
+                    bbox = Rect(char["bbox"])
+                    if abs(bbox & cell) > 0.5 * abs(bbox):
+                        span_text += char["c"]
+
+                if not span_text:
+                    continue  # skip empty span
+
+                if not markdown:  # no MD styling
+                    text += span_text
+                    continue
+
+                prefix = ""
+                suffix = ""
+                if horizontal and span["char_flags"] & TEXT_STRIKEOUT:
+                    prefix += "~~"
+                    suffix = "~~" + suffix
+                if span["char_flags"] & TEXT_BOLD:
+                    prefix += "**"
+                    suffix = "**" + suffix
+                if span["flags"] & TEXT_FONT_ITALIC:
+                    prefix += "_"
+                    suffix = "_" + suffix
+                if span["flags"] & TEXT_FONT_MONOSPACED:
+                    prefix += "`"
+                    suffix = "`" + suffix
+
+                if len(span["chars"]) > 2:
+                    span_text = span_text.rstrip()
+
+                # if span continues previous styling: extend cell text
+                if (ls := len(suffix)) and text.endswith(suffix):
+                    text = text[:-ls] + span_text + suffix
+                else:  # append the span with new styling
+                    if not span_text.strip():
+                        text += " "
+                    else:
+                        text += prefix + span_text + suffix
+
+    return text.strip()
+
+
 # -------------------------------------------------------------------
 # End of PyMuPDF interface code
 # -------------------------------------------------------------------
@@ -1382,7 +1494,18 @@ def to_markdown(self, clean=False, fill_empty=True):
         output = "|"
         rows = self.row_count
         cols = self.col_count
-        cells = self.extract()[:]  # make local copy of table text content
+
+        # cell coordinates
+        cell_boxes = [[c for c in r.cells] for r in self.rows]
+
+        # cell text strings
+        cells = [[None for i in range(cols)] for j in range(rows)]
+        for i, row in enumerate(cell_boxes):
+            for j, cell in enumerate(row):
+                if cell is not None:
+                    cells[i][j] = extract_cells(
+                        TEXTPAGE, cell_boxes[i][j], markdown=True
+                    )
 
         if fill_empty:  # fill "None" cells where possible
 
@@ -1420,7 +1543,8 @@ def to_markdown(self, clean=False, fill_empty=True):
             for i, cell in enumerate(row):
                 # replace None cells with empty string
                 # use HTML line break tag
-                cell = "" if not cell else cell.replace("\n", "<br>")
+                if cell is None:
+                    cell = ""
                 if clean:  # remove sensitive syntax
                     cell = html.escape(cell.replace("-", "&#45;"))
                 line += cell + "|"
@@ -1944,7 +2068,7 @@ def make_chars(page, clip=None):
     page_number = page.number + 1
     page_height = page.rect.height
     ctm = page.transformation_matrix
-    TEXTPAGE = page.get_textpage(clip=clip, flags=TEXTFLAGS_TEXT)
+    TEXTPAGE = page.get_textpage(clip=clip, flags=FLAGS)
     blocks = page.get_text("rawdict", textpage=TEXTPAGE)["blocks"]
     doctop_base = page_height * page.number
     for block in blocks:
diff --git a/tests/resources/test-styled-table.pdf b/tests/resources/test-styled-table.pdf
diff --git a/tests/test_tables.py b/tests/test_tables.py
@@ -423,3 +423,13 @@ def test_4017():
             ["Weighted Average Life", "4.83", "<=", "9.00", "", "PASS", "4.92"],
         ]
         assert tables[-1].extract() == expected_b
+
+
+def test_md_styles():
+    """Test output of table with MD-styled cells."""
+    filename = os.path.join(scriptdir, "resources", "test-styled-table.pdf")
+    doc = pymupdf.open(filename)
+    page = doc[0]
+    tabs = page.find_tables()[0]
+    text = """|Column 1|Column 2|Column 3|\n|---|---|---|\n|Zelle (0,0)|**Bold (0,1)**|Zelle (0,2)|\n|~~Strikeout (1,0), Zeile 1~~<br>~~Hier kommt Zeile 2.~~|Zelle (1,1)|~~Strikeout (1,2)~~|\n|**`Bold-monospaced`**<br>**`(2,0)`**|_Italic (2,1)_|**_Bold-italic_**<br>**_(2,2)_**|\n|Zelle (3,0)|~~**Bold-strikeout**~~<br>~~**(3,1)**~~|Zelle (3,2)|\n\n"""
+    assert tabs.to_markdown() == text