Page content conversion suitable as input for LLM / RAG #3293
Replies: 4 comments 10 replies
-
We already have a solution for your request. Let me transfer your post to the "Discussions". |
Beta Was this translation helpful? Give feedback.
-
I understood you need to extract page content in a way that
Tables also contain text. We want to process them separately and prevent that their content is being processed by standard text extraction.
Starting with an empty string, we iterate over this list of rectangles and, depending on rectangle type, we either
Here is some suggested code to loop over the pages: for page in doc:
# 1. Locate all tables on page
tabs = page.find_tables()
# 2. Make a list of table boundary boxes, sort vertical by top-left corner
tab_rects = sorted(
[fitz.Rect(t.bbox) for t in tabs],
key=lambda r: (r.y0, r.x0),
)
# 3. Compute final list of all text and table rectangles
text_rects = []
# Compute non-table rectangles and fill final rect list
for i, r in enumerate(tab_rects):
if i == 0: # compute rect above all tables
tr = page.rect # start with full page rect as template
tr.y1 = r.y0 # set bottom to top of table rect
if not tr.is_empty: # there is room above 1st table
text_rects.append(("text", tr))
text_rects.append(("table", r)) # append 1st table rect
continue
# read previous rectangle in final list: always a table!
_, r0 = text_rects[-1]
# check if a non-empty text rect is fitting in between tables
tr = page.rect # page rect as template
tr.y0 = r0.y1 # modify top
tr.y1 = r.y0 # ... and bottom
if not tr.is_empty: # may be empty!
text_rects.append(("text", tr))
text_rects.append(("table", r))
# Don't forget text that may be below all tables
if i == len(tab_rects) - 1:
tr = page.rect
tr.y0 = r.y1
if not tr.is_empty:
text_rects.append(("text", tr))
if text_rects == []: # this happens if page has no tables
text_rects.append(("text", page.rect))
else:
rtype, r = text_rects[-1]
if rtype == "table":
tr = page.rect
tr.y0 = r.y1
if not tr.is_empty:
text_rects.append(("text", tr))
# we have all rectangles and can start the output
for rtype, r in text_rects:
if rtype == "text": # a text rectangle
out.write(write_text(page, r)) # write MD content
else: # a table rect
for tab in tabs:
# initial sort may have changed sequence, so we need
# to look up the right table for this rectangle.
if fitz.Rect(tab.bbox) == r:
# output table in MD format
out.write(tab.to_markdown()) We are in the process to create a "standard" script for this purpose. So you will soon find it in one of our pymupdf repositories.
for block in page.get_text("dict", clip=r, sort=True)["blocks"]:
if block["type"] == 1: # this is an image block
# process the image in some way
else: # this is a text block
# extract / process the text Details on the structure of each of the above |
Beta Was this translation helpful? Give feedback.
-
Hm - I think it does work.
Please let me have a problem page.
…________________________________
Von: AniketModi ***@***.***>
Gesendet: Freitag, 22. März 2024 07:07
An: pymupdf/PyMuPDF ***@***.***>
Cc: Jorj X. McKie ***@***.***>; Comment ***@***.***>
Betreff: Re: [pymupdf/PyMuPDF] Page content conversion suitable as input for LLM / RAG (Discussion #3293)
The script will not handle the below use-case (where there are more than one table and there is a text in between tables), right ?
Page has content in following order:
1. text
2. table
3. text
4. table
—
Reply to this email directly, view it on GitHub<#3293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIT6IR27A3PCX2AEWXLYZQGIXAVCNFSM6AAAAABFC6JZ4WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DQNZWG43TI>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I tested the entire flow before putting to the production but find_tables() is taking too much intensive CPU. Here, is the sample pdf to try. Can you help to suggest approaches such that it will not take so much high CPU. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is your feature request related to a problem? Please describe.
I have a pdf containing of text and image. I want to parse the pdf in a way that I want to extract text. for the image, I will generate the summary of the image using LLM and then will store it in our database. If one page has image and text both, so I want to take the order of text and image such that whether text is first or image is first on the page.
Describe the solution you'd like
May be we can use co-ordinates or some other thing to know the position of the text and image in the pdf page.
Describe alternatives you've considered
Are there several options for how your request could be met?
Additional context
Add any other context or screenshots about the feature request here.
Beta Was this translation helpful? Give feedback.
All reactions