Skip to content

Table Feedback #3149

@kyliemsauter

Description

@kyliemsauter

Love the library! Here's some misc. feedback relating to table extraction that may or may not be useful to you guys:

  1. The "bboxes" from clean_graphics() could be used to help narrow down the potential table candidates...I'm parsing tables in pdfs that have technical drawings that are often comprised of thousands of lines each which ends up taking awhile so I modified it to check if there's any text inside the "bbox" and if there's not then I remove that one from the list and then filter "paths" to remove any of the paths that are contained by that bbox, it cuts the time way down.
  2. For the same reason, it would be useful to have find_tables() skip make_edges() when the line strategies are both set to "explicit"
  3. I'm extracting the table contents to CSV so I needed to extract the text with respect to row/column spans so I modified the table extract() method and added a to_csv() method (see below), not sure if those are something you'd consider including in the library as an option or maybe just add to the examples/useful scripts section. (I'm not much of a python programmer so I'm sure they could be improved but conceptually they work)
class Table:
    def extract(self, **kwargs) -> list:
        chars = fitz.table.CHARS
        table_arr = [[None] * self.col_count for i in range(self.row_count)]  # final result

        def char_in_bbox(char, bbox) -> bool:
            v_mid = (char["top"] + char["bottom"]) / 2
            h_mid = (char["x0"] + char["x1"]) / 2
            x0, top, x1, bottom = bbox
            return bool(
                (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
            )

        table_lines = {
            'x0': {k: v for v, k in enumerate(sorted(set([cell[0] for cell in self.cells])))},
            'x1': {k: v for v, k in enumerate(sorted(set([cell[2] for cell in self.cells])))},
            'y1': {k: v for v, k in enumerate(sorted(set([cell[3] for cell in self.cells])))}
        }

        for row_idx in range(self.row_count):
            row = self.rows[row_idx]
            row_chars = [char for char in chars if char_in_bbox(char, row.bbox)]

            for cell_idx in range(len(row.cells)):
                cell = row.cells[cell_idx]
                if cell is not None:
                    cell_chars = [
                        char for char in row_chars if char_in_bbox(char, cell)
                    ]

                    if len(cell_chars):
                        kwargs["x_shift"] = cell[0]
                        kwargs["y_shift"] = cell[1]
                        if "layout" in kwargs:
                            kwargs["layout_width"] = cell[2] - cell[0]
                            kwargs["layout_height"] = cell[3] - cell[1]
                        cell_text = fitz.table.extract_text(cell_chars, **kwargs)
                    else:
                        cell_text = ""

                    col = table_lines['x0'][cell[0]]
                    rowSpan = (table_lines['y1'][cell[3]] - row_idx) + 1
                    colSpan = (table_lines['x1'][cell[2]] - col) + 1

                    for i in range(rowSpan):
                        for j in range(colSpan):
                            if row_idx + i < self.row_count and col + j < self.col_count:
                                table_arr[row_idx + i][col + j] = cell_text

        return table_arr

    def to_csv(self, delimeter = ',', new_line = '\n'):
        def cell_text(cell):
            if cell is None:
                return ''
            if delimeter in cell or '"' in cell:
                cell = cell.replace('"', '""')
                return '"' + cell + '"'
            return cell

        contents = self.extract()

        # dont repeat table caption
        if self.rows[0].cells[0] and self.rows[0].cells[0][2] == self.rows[0].bbox[2]:
            contents[0] = [contents[0][0]]

        content = ''
        for row in contents:
            content += ','.join([cell_text(cell) for cell in row]).replace(new_line, ' ') + new_line
        return content

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions