You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**PDF only**: Remove all **content** contained in any redaction rectangle on the page.
337
337
@@ -2338,18 +2338,28 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
2338
2338
2339
2339
The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
2340
2340
2341
+
.. note::
2342
+
2343
+
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
2344
+
2345
+
However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
2346
+
2341
2347
2342
2348
.. class:: TableFinder
2343
2349
2344
2350
An object always returned by :meth:`Page.find_tables`. Attributes of interest:
2345
2351
2346
-
... attribute:: tables
2352
+
.. attribute:: tables
2347
2353
2348
-
A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
2354
+
A list of `Table` objects, each of which represents a table found on the page. Empty list if no table found.
2349
2355
2350
-
... attribute:: page
2356
+
.. attribute:: page
2351
2357
2352
-
A reference to the :ref:`Page` object.
2358
+
A reference (weakref proxy) to the owning :ref:`Page` object.
2359
+
2360
+
.. attribute:: cells
2361
+
2362
+
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of all table cells (in any tables) found on the page. Note that cells may also be ``None`` objects, which are created to enforce a complete rows x columns structure for the affected table.
2353
2363
2354
2364
2355
2365
.. class:: Table
@@ -2360,25 +2370,85 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
2360
2370
2361
2371
The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
2362
2372
2363
-
2364
-
2365
2373
.. attribute:: cells
2366
2374
2375
+
A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of the cells in the table. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
2376
+
2377
+
.. attribute:: rows
2378
+
2379
+
A list of :ref:`TableRow` objects, each of which represents a row in the table. The order of rows is the same as in the original table. If the table has no rows, this will be an empty list.
2380
+
2381
+
.. attribute:: col_count
2382
+
2383
+
The number of columns in the table (integer).
2384
+
2385
+
.. attribute:: row_count
2386
+
2387
+
The number of rows in the table (integer).
2388
+
2389
+
.. method:: extract
2390
+
2391
+
Returns a (row-major) list of lists representing the plain text of the table cells. Each sublist contains the text of one row, and each item in that sublist is the text of one cell in that row. So, `Table.extract()[i][j]` will return the text of the cell in row ``i`` and column ``j``. If a cell is empty, the corresponding item will be an empty string. If the corresponding boundary box is ``None``, the item will also be ``None``.
Returns a string in `GitHub Markdown format <https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables>`_ representing the table. The string will contain a header line with column names, followed by a separator line, and then the rows of the table. The text of each cell will be enclosed in pipe characters `|`, and each row will be separated by a newline character `\n`. **Line breaks inside a cell** are being replaced by the HTML `<br>` tag. Bold, italic, mono-spaced and strikethrough text will be styled according to the corresponding Markdown syntax.
2396
+
2397
+
- Bold text will be enclosed in double asterisks ``"**"``.
2398
+
2399
+
- Italic text will be enclosed in single underscore ``"_"``.
2400
+
2401
+
- Mono-spaced text will be enclosed in backticks ``"`"``.
2402
+
2403
+
- Strikethrough text will be enclosed in double tildes ``"~~"``.
2404
+
2405
+
:arg bool clean: if ``True``, any hyphen "-" in the text is replaced by a ``"-"`` character.
2406
+
2407
+
:arg bool fill_empty: if ``True``, empty cells will be filled with a copy of neighboring cells in an effort to indicate potential column and row spans.
2408
+
2409
+
* For each row and starting with index 1, the cell content will be replaced with the content of its left neighbor if it is ``None``.
2410
+
2411
+
* For each column and starting with index 1, the cell content will be replaced with the content of its upper neighbor if it is ``None``.
2412
+
2413
+
2414
+
.. method:: to_pandas()
2415
+
2416
+
Returns a `pandas.DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ representing the table. The DataFrame offers a plethora of functions, among them conversion to 20+ file formats (CSV, markdown, JSON, Excel, HD5 etc.). Where necessary, the table can be refined in multiple ways (e.g. deleting empty rows or columns) and mutliple DataFrames can be joined.
2367
2417
2368
2418
2369
2419
.. class:: TableHeader
2370
2420
2421
+
.. attribute:: names
2422
+
2423
+
A list of strings representing the column names of the `Table`. This is usually the text content of the top row cells, but may instead be content identified above the detected table. The respective situation is encoded in the following attribute.
2424
+
2425
+
.. attribute:: is_external
2426
+
2427
+
Whether the header is part of the originally detected table (``False``) or was identified above the table (``True``). If ``True``, the header is not part of the table, but is used to identify the columns in the table. In this case, the header text will be used as column names in the extracted data.
2428
+
2429
+
.. attribute:: bbox
2430
+
2431
+
The bounding box of the header given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the header. If the header is not part of the table, this will be the rectangle that contains all cells of the header text, otherwise it is equal of the top row's boundary box.
2432
+
2433
+
.. attribute:: cells
2434
+
2435
+
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in the header. Note that cells may also be ``None``, which will happen to prevent any gaps in a rows x columns structure. If the header is not part of the table, this will be the bounding boxes of the header text.
2436
+
2437
+
2371
2438
.. class:: TableRow
2372
2439
2440
+
An object defining a row in a `Table` found on the page. Attributes of interest:
2373
2441
2442
+
.. attribute:: bbox
2374
2443
2444
+
The bounding box of the row given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the row.
2375
2445
2446
+
.. attribute:: cells
2447
+
2448
+
A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in this row. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.
2376
2449
2377
-
.. note::
2378
2450
2379
-
Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
2380
2451
2381
-
However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.
0 commit comments