⚠️ I may have found a way to improve extraction but I need your help ⚠️ #2255
-
Hey there, I'm working on a PDF parsing project. PyMuPDF is arguably one of the best libraries to extract text from a PDF, but if there's a table, the output sucks. 🙋♂️ approach: draw a white rectangle over every table # open pdf
with fitz.open(input_file) as pdf:
# Loop over each page in the PDF
for page in pdf:
# convert page into image
pix = page.get_pixmap(dpi=DPI)
image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# find the coordinates of all the tables
boxes = get_coordinates(image, THRESH, model)
# Draw a white-filled rectangle on every table
page = clean_page(page, boxes, DPI)
# Extract text from the edited page
TEXT += page.get_text()
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
This is no issue, but a "Discussions" item. So I will first transfer it. |
Beta Was this translation helpful? Give feedback.
This is no issue, but a "Discussions" item. So I will first transfer it.