Skip to content

Commit 9a16da9

Browse files
committed
Table Cell Markdown Support
1 parent ed96c10 commit 9a16da9

File tree

4 files changed

+217
-13
lines changed

4 files changed

+217
-13
lines changed

docs/page.rst

Lines changed: 80 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,7 @@ In a nutshell, this is what you can do with PyMuPDF:
331331
|history_end|
332332

333333

334-
.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
334+
.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
335335

336336
**PDF only**: Remove all **content** contained in any redaction rectangle on the page.
337337

@@ -2338,18 +2338,28 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
23382338

23392339
The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
23402340

2341+
.. note::
2342+
2343+
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
2344+
2345+
However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
2346+
23412347

23422348
.. class:: TableFinder
23432349

23442350
An object always returned by :meth:`Page.find_tables`. Attributes of interest:
23452351

2346-
... attribute:: tables
2352+
.. attribute:: tables
23472353

2348-
A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
2354+
A list of `Table` objects, each of which represents a table found on the page. Empty list if no table found.
23492355

2350-
... attribute:: page
2356+
.. attribute:: page
23512357

2352-
A reference to the :ref:`Page` object.
2358+
A reference (weakref proxy) to the owning :ref:`Page` object.
2359+
2360+
.. attribute:: cells
2361+
2362+
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of all table cells (in any tables) found on the page. Note that cells may also be ``None`` objects, which are created to enforce a complete rows x columns structure for the affected table.
23532363

23542364

23552365
.. class:: Table
@@ -2360,25 +2370,85 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
23602370

23612371
The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
23622372

2363-
2364-
23652373
.. attribute:: cells
23662374

2375+
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of the cells in the table. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
2376+
2377+
.. attribute:: rows
2378+
2379+
A list of :ref:`TableRow` objects, each of which represents a row in the table. The order of rows is the same as in the original table. If the table has no rows, this will be an empty list.
2380+
2381+
.. attribute:: col_count
2382+
2383+
The number of columns in the table (integer).
2384+
2385+
.. attribute:: row_count
2386+
2387+
The number of rows in the table (integer).
2388+
2389+
.. method:: extract
2390+
2391+
Returns a (row-major) list of lists representing the plain text of the table cells. Each sublist contains the text of one row, and each item in that sublist is the text of one cell in that row. So, `Table.extract()[i][j]` will return the text of the cell in row ``i`` and column ``j``. If a cell is empty, the corresponding item will be an empty string. If the corresponding boundary box is ``None``, the item will also be ``None``.
2392+
2393+
.. method:: to_markdown(clean=False, fill_empty=True)
2394+
2395+
Returns a string in `GitHub Markdown format <https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables>`_ representing the table. The string will contain a header line with column names, followed by a separator line, and then the rows of the table. The text of each cell will be enclosed in pipe characters `|`, and each row will be separated by a newline character `\n`. **Line breaks inside a cell** are being replaced by the HTML `<br>` tag. Bold, italic, mono-spaced and strikethrough text will be styled according to the corresponding Markdown syntax.
2396+
2397+
- Bold text will be enclosed in double asterisks ``"**"``.
2398+
2399+
- Italic text will be enclosed in single underscore ``"_"``.
2400+
2401+
- Mono-spaced text will be enclosed in backticks ``"`"``.
2402+
2403+
- Strikethrough text will be enclosed in double tildes ``"~~"``.
2404+
2405+
:arg bool clean: if ``True``, any hyphen "-" in the text is replaced by a ``"&#45;"`` character.
2406+
2407+
:arg bool fill_empty: if ``True``, empty cells will be filled with a copy of neighboring cells in an effort to indicate potential column and row spans.
2408+
2409+
* For each row and starting with index 1, the cell content will be replaced with the content of its left neighbor if it is ``None``.
2410+
2411+
* For each column and starting with index 1, the cell content will be replaced with the content of its upper neighbor if it is ``None``.
2412+
2413+
2414+
.. method:: to_pandas()
2415+
2416+
Returns a `pandas.DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ representing the table. The DataFrame offers a plethora of functions, among them conversion to 20+ file formats (CSV, markdown, JSON, Excel, HD5 etc.). Where necessary, the table can be refined in multiple ways (e.g. deleting empty rows or columns) and mutliple DataFrames can be joined.
23672417

23682418

23692419
.. class:: TableHeader
23702420

2421+
.. attribute:: names
2422+
2423+
A list of strings representing the column names of the `Table`. This is usually the text content of the top row cells, but may instead be content identified above the detected table. The respective situation is encoded in the following attribute.
2424+
2425+
.. attribute:: is_external
2426+
2427+
Whether the header is part of the originally detected table (``False``) or was identified above the table (``True``). If ``True``, the header is not part of the table, but is used to identify the columns in the table. In this case, the header text will be used as column names in the extracted data.
2428+
2429+
.. attribute:: bbox
2430+
2431+
The bounding box of the header given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the header. If the header is not part of the table, this will be the rectangle that contains all cells of the header text, otherwise it is equal of the top row's boundary box.
2432+
2433+
.. attribute:: cells
2434+
2435+
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in the header. Note that cells may also be ``None``, which will happen to prevent any gaps in a rows x columns structure. If the header is not part of the table, this will be the bounding boxes of the header text.
2436+
2437+
23712438
.. class:: TableRow
23722439

2440+
An object defining a row in a `Table` found on the page. Attributes of interest:
23732441

2442+
.. attribute:: bbox
23742443

2444+
The bounding box of the row given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the row.
23752445

2446+
.. attribute:: cells
2447+
2448+
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in this row. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
23762449

2377-
.. note::
23782450

2379-
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
23802451

2381-
However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
23822452

23832453
.. rubric:: Footnotes
23842454

src/table.py

Lines changed: 127 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,18 +89,130 @@
8989
Matrix,
9090
TEXTFLAGS_TEXT,
9191
TEXT_FONT_BOLD,
92+
TEXT_FONT_ITALIC,
93+
TEXT_FONT_MONOSPACED,
9294
TEXT_FONT_SUPERSCRIPT,
95+
TEXT_COLLECT_STYLES,
9396
TOOLS,
9497
EMPTY_RECT,
9598
sRGB_to_pdf,
9699
Point,
97100
message,
101+
mupdf,
98102
)
99103

100104
EDGES = [] # vector graphics from PyMuPDF
101105
CHARS = [] # text characters from PyMuPDF
102106
TEXTPAGE = None
107+
TEXT_BOLD = mupdf.FZ_STEXT_BOLD
108+
TEXT_STRIKEOUT = mupdf.FZ_STEXT_STRIKEOUT
109+
FLAGS = TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES
110+
103111
white_spaces = set(string.whitespace) # for checking white space only cells
112+
113+
114+
def extract_cells(textpage, cell, markdown=False):
115+
"""Extract text from a rect-like 'cell' as plain or MD style text.
116+
117+
This function should ultimately be used to extract text from a table cell.
118+
Markdown output will only work correctly if extraction flag bit
119+
TEXT_COLLECT_STYLES is set.
120+
121+
Args:
122+
textpage: A PyMuPDF TextPage object. Must have been created with
123+
TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES.
124+
cell: A tuple (x0, y0, x1, y1) defining the cell's bbox.
125+
markdown: If True, return text formatted for Markdown.
126+
127+
Returns:
128+
A string with the text extracted from the cell.
129+
"""
130+
text = ""
131+
for block in textpage.extractRAWDICT()["blocks"]:
132+
if block["type"] != 0:
133+
continue
134+
block_bbox = block["bbox"]
135+
if (
136+
0
137+
or block_bbox[0] > cell[2]
138+
or block_bbox[2] < cell[0]
139+
or block_bbox[1] > cell[3]
140+
or block_bbox[3] < cell[1]
141+
):
142+
continue # skip block outside cell
143+
line_count = len(block["lines"])
144+
for line in block["lines"]:
145+
lbbox = line["bbox"]
146+
if (
147+
0
148+
or lbbox[0] > cell[2]
149+
or lbbox[2] < cell[0]
150+
or lbbox[1] > cell[3]
151+
or lbbox[3] < cell[1]
152+
):
153+
continue # skip line outside cell
154+
155+
if text: # must be a new line in the cell
156+
text += "<br>" if markdown else "\n"
157+
158+
# strikeout detection only works with horizontal text
159+
horizontal = line["dir"] == (0, 1) or line["dir"] == (1, 0)
160+
161+
for span in line["spans"]:
162+
sbbox = span["bbox"]
163+
if (
164+
0
165+
or sbbox[0] > cell[2]
166+
or sbbox[2] < cell[0]
167+
or sbbox[1] > cell[3]
168+
or sbbox[3] < cell[1]
169+
):
170+
continue # skip spans outside cell
171+
172+
# only include chars with more than 50% bbox overlap
173+
span_text = ""
174+
for char in span["chars"]:
175+
bbox = Rect(char["bbox"])
176+
if abs(bbox & cell) > 0.5 * abs(bbox):
177+
span_text += char["c"]
178+
179+
if not span_text:
180+
continue # skip empty span
181+
182+
if not markdown: # no MD styling
183+
text += span_text
184+
continue
185+
186+
prefix = ""
187+
suffix = ""
188+
if horizontal and span["char_flags"] & TEXT_STRIKEOUT:
189+
prefix += "~~"
190+
suffix = "~~" + suffix
191+
if span["char_flags"] & TEXT_BOLD:
192+
prefix += "**"
193+
suffix = "**" + suffix
194+
if span["flags"] & TEXT_FONT_ITALIC:
195+
prefix += "_"
196+
suffix = "_" + suffix
197+
if span["flags"] & TEXT_FONT_MONOSPACED:
198+
prefix += "`"
199+
suffix = "`" + suffix
200+
201+
if len(span["chars"]) > 2:
202+
span_text = span_text.rstrip()
203+
204+
# if span continues previous styling: extend cell text
205+
if (ls := len(suffix)) and text.endswith(suffix):
206+
text = text[:-ls] + span_text + suffix
207+
else: # append the span with new styling
208+
if not span_text.strip():
209+
text += " "
210+
else:
211+
text += prefix + span_text + suffix
212+
213+
return text.strip()
214+
215+
104216
# -------------------------------------------------------------------
105217
# End of PyMuPDF interface code
106218
# -------------------------------------------------------------------
@@ -1382,7 +1494,18 @@ def to_markdown(self, clean=False, fill_empty=True):
13821494
output = "|"
13831495
rows = self.row_count
13841496
cols = self.col_count
1385-
cells = self.extract()[:] # make local copy of table text content
1497+
1498+
# cell coordinates
1499+
cell_boxes = [[c for c in r.cells] for r in self.rows]
1500+
1501+
# cell text strings
1502+
cells = [[None for i in range(cols)] for j in range(rows)]
1503+
for i, row in enumerate(cell_boxes):
1504+
for j, cell in enumerate(row):
1505+
if cell is not None:
1506+
cells[i][j] = extract_cells(
1507+
TEXTPAGE, cell_boxes[i][j], markdown=True
1508+
)
13861509

13871510
if fill_empty: # fill "None" cells where possible
13881511

@@ -1420,7 +1543,8 @@ def to_markdown(self, clean=False, fill_empty=True):
14201543
for i, cell in enumerate(row):
14211544
# replace None cells with empty string
14221545
# use HTML line break tag
1423-
cell = "" if not cell else cell.replace("\n", "<br>")
1546+
if cell is None:
1547+
cell = ""
14241548
if clean: # remove sensitive syntax
14251549
cell = html.escape(cell.replace("-", "&#45;"))
14261550
line += cell + "|"
@@ -1944,7 +2068,7 @@ def make_chars(page, clip=None):
19442068
page_number = page.number + 1
19452069
page_height = page.rect.height
19462070
ctm = page.transformation_matrix
1947-
TEXTPAGE = page.get_textpage(clip=clip, flags=TEXTFLAGS_TEXT)
2071+
TEXTPAGE = page.get_textpage(clip=clip, flags=FLAGS)
19482072
blocks = page.get_text("rawdict", textpage=TEXTPAGE)["blocks"]
19492073
doctop_base = page_height * page.number
19502074
for block in blocks:
73 KB
Binary file not shown.

tests/test_tables.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,3 +423,13 @@ def test_4017():
423423
["Weighted Average Life", "4.83", "<=", "9.00", "", "PASS", "4.92"],
424424
]
425425
assert tables[-1].extract() == expected_b
426+
427+
428+
def test_md_styles():
429+
"""Test output of table with MD-styled cells."""
430+
filename = os.path.join(scriptdir, "resources", "test-styled-table.pdf")
431+
doc = pymupdf.open(filename)
432+
page = doc[0]
433+
tabs = page.find_tables()[0]
434+
text = """|Column 1|Column 2|Column 3|\n|---|---|---|\n|Zelle (0,0)|**Bold (0,1)**|Zelle (0,2)|\n|~~Strikeout (1,0), Zeile 1~~<br>~~Hier kommt Zeile 2.~~|Zelle (1,1)|~~Strikeout (1,2)~~|\n|**`Bold-monospaced`**<br>**`(2,0)`**|_Italic (2,1)_|**_Bold-italic_**<br>**_(2,2)_**|\n|Zelle (3,0)|~~**Bold-strikeout**~~<br>~~**(3,1)**~~|Zelle (3,2)|\n\n"""
435+
assert tabs.to_markdown() == text

0 commit comments

Comments
 (0)