⚠️ I may have found a way to improve extraction but I need your help ⚠️ #2255

sergenti · 2023-02-25T13:36:46Z

sergenti
Feb 25, 2023

Hey there, I'm working on a PDF parsing project.

PyMuPDF is arguably one of the best libraries to extract text from a PDF, but if there's a table, the output sucks.
To improve the parsing even more, I'm using an AI model to detect tables and remove them from the page.

🙋‍♂️ approach: draw a white rectangle over every table
🤔 problem: PyMuPDF still extracts the characters under the white rectangle
🧐 question: is there a way to delete all the elements under a certain rectangle or something similar?

  # open pdf
  with fitz.open(input_file) as pdf:

      # Loop over each page in the PDF
      for page in pdf:

          # convert page into image
          pix = page.get_pixmap(dpi=DPI)
          image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

          # find the coordinates of all the tables
          boxes = get_coordinates(image, THRESH, model)

          # Draw a white-filled rectangle on every table
          page = clean_page(page, boxes, DPI)

          # Extract text from the edited page
          TEXT += page.get_text()

TLDR; Table parsing is terrible. Can we ignore all tables in Page.get_text()?

Answered by JorjMcKie

Feb 25, 2023

This is no issue, but a "Discussions" item. So I will first transfer it.

View full answer

JorjMcKie · 2023-02-25T16:33:42Z

JorjMcKie
Feb 25, 2023
Maintainer

This is no issue, but a "Discussions" item. So I will first transfer it.

5 replies

JorjMcKie Feb 25, 2023
Maintainer

If a table's bbox is known and if the cells delimitred by borders, there exists a script which extracts and outputs to CSV.

Your other question about reoving table content:
This is possible using redaction annotations. You can remove text, links and images inside a rectangular area. Line art (drawings cannot be redacted however).
So your red rectangle above can be emptied from all text.

sergenti Feb 25, 2023
Author

Hi Jorj, sorry for putting this as an issue; this is my first time contributing to a library, and I wasn't sure where to write it.

Sounds awesome! Could you show me how to use the "redaction annotations" in PyMuPDF? Is there an example in the docs?

sergenti Feb 25, 2023
Author

found it

def clean_page(page, boxes, DPI):

    # check if there are tables in the page
    if len(boxes) != 0:

        # loop through every box
        for box in boxes:

            # convert to pdf coordinates
            box = [x * 72/DPI for x in box]

            # add a redaction annotation
            page.add_redact_annot(box)

    # remove redacted areas
    page.apply_redactions()
    return page

does not delete the lines but improves the parsing anyway :)

JorjMcKie Feb 26, 2023
Maintainer

Absolutely right - felicitaciones.
Removal of line art is not supported (yet) as part of redactions.

JorjMcKie Feb 26, 2023
Maintainer

Hi Jorj, sorry for putting this as an issue; this is my first time contributing to a library, and I wasn't sure where to write it.

No worries - that's fine. It indeed is not always clear, whether something is a bug, an enhancement, or just a feature yet unknown to someone new to PyMuPDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚠️ I may have found a way to improve extraction but I need your help ⚠️ #2255

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

⚠️ I may have found a way to improve extraction but I need your help ⚠️ #2255

Uh oh!

Uh oh!

sergenti Feb 25, 2023

Replies: 1 comment · 5 replies

Uh oh!

JorjMcKie Feb 25, 2023 Maintainer

Uh oh!

JorjMcKie Feb 25, 2023 Maintainer

Uh oh!

sergenti Feb 25, 2023 Author

Uh oh!

Uh oh!

sergenti Feb 25, 2023 Author

Uh oh!

JorjMcKie Feb 26, 2023 Maintainer

Uh oh!

JorjMcKie Feb 26, 2023 Maintainer

sergenti
Feb 25, 2023

Replies: 1 comment 5 replies

JorjMcKie
Feb 25, 2023
Maintainer

JorjMcKie Feb 25, 2023
Maintainer

sergenti Feb 25, 2023
Author

sergenti Feb 25, 2023
Author

JorjMcKie Feb 26, 2023
Maintainer

JorjMcKie Feb 26, 2023
Maintainer