Page content conversion suitable as input for LLM / RAG #3293

AniketModi · 2024-03-22T05:25:11Z

AniketModi
Mar 22, 2024

Is your feature request related to a problem? Please describe.
I have a pdf containing of text and image. I want to parse the pdf in a way that I want to extract text. for the image, I will generate the summary of the image using LLM and then will store it in our database. If one page has image and text both, so I want to take the order of text and image such that whether text is first or image is first on the page.

Describe the solution you'd like
May be we can use co-ordinates or some other thing to know the position of the text and image in the pdf page.

Describe alternatives you've considered
Are there several options for how your request could be met?

Additional context
Add any other context or screenshots about the feature request here.

JorjMcKie · 2024-03-22T09:02:26Z

JorjMcKie
Mar 22, 2024
Maintainer

We already have a solution for your request. Let me transfer your post to the "Discussions".

2 replies

JorjMcKie Mar 22, 2024
Maintainer

BTW I am also taking the liberty to rename the title of your post to clarify its purpose.

AniketModi Mar 22, 2024
Author

Can you please suggest solution for it. I couldn't find any valid solution

JorjMcKie · 2024-03-22T10:10:50Z

JorjMcKie
Mar 22, 2024
Maintainer

I understood you need to extract page content in a way that

Maintains natural reading sequence
Segments by text, images, and tables

Tables also contain text. We want to process them separately and prevent that their content is being processed by standard text extraction.
So we extract them first and avoid their areas (bboxes) when extracting standard text later.
Assuming that standard text (and images) only occurs above and below any table (and not to the table's left or right), we can build a list of page rectangles with the following properties:

Each rectangle has the same width as the page.
Rectangles are sorted from top to bottom.
Each rectangle is either of type "table" or "text/image".

Note: just in case your page can have text to the left or right of some table, then a bit more logic is required to calculated above rectangle list. No show stopper.

Starting with an empty string, we iterate over this list of rectangles and, depending on rectangle type, we either

extract table content as a string using the new (version 1.24.0) Table.to_markdown() method, or
the text (including any images).

Here is some suggested code to loop over the pages:

for page in doc:
    # 1. Locate all tables on page
    tabs = page.find_tables()

    # 2. Make a list of table boundary boxes, sort vertical by top-left corner
    tab_rects = sorted(
        [fitz.Rect(t.bbox) for t in tabs],
        key=lambda r: (r.y0, r.x0),
    )

    # 3. Compute final list of all text and table rectangles
    text_rects = []
    # Compute non-table rectangles and fill final rect list
    for i, r in enumerate(tab_rects):
        if i == 0:  # compute rect above all tables
            tr = page.rect  # start with full page rect as template
            tr.y1 = r.y0  # set bottom to top of table rect
            if not tr.is_empty:  # there is room above 1st table
                text_rects.append(("text", tr))
            text_rects.append(("table", r))  # append 1st table rect
            continue
        # read previous rectangle in final list: always a table!
        _, r0 = text_rects[-1]

        # check if a non-empty text rect is fitting in between tables
        tr = page.rect  # page rect as template
        tr.y0 = r0.y1  # modify top
        tr.y1 = r.y0  # ... and bottom
        if not tr.is_empty:  # may be empty!
            text_rects.append(("text", tr))

        text_rects.append(("table", r))

        # Don't forget text that may be below all tables
        if i == len(tab_rects) - 1:
            tr = page.rect
            tr.y0 = r.y1
            if not tr.is_empty:
                text_rects.append(("text", tr))
    if text_rects == []:  # this happens if page has no tables
        text_rects.append(("text", page.rect))
    else:
        rtype, r = text_rects[-1]
        if rtype == "table":
            tr = page.rect
            tr.y0 = r.y1
            if not tr.is_empty:
                text_rects.append(("text", tr))

    # we have all rectangles and can start the output
    for rtype, r in text_rects:
        if rtype == "text":  # a text rectangle
            out.write(write_text(page, r))  # write MD content
        else:  # a table rect
            for tab in tabs:
                # initial sort may have changed sequence, so we need
                # to look up the right table for this rectangle.
                if fitz.Rect(tab.bbox) == r:
                    # output table in MD format
                    out.write(tab.to_markdown())

We are in the process to create a "standard" script for this purpose. So you will soon find it in one of our pymupdf repositories.

Above loop will write page content to a text file in Markdown format - please note references to the file out above.
Function write_text(page, r) above basically is the following:

for block in page.get_text("dict", clip=r, sort=True)["blocks"]:
    if block["type"] == 1:  # this is an image block
        # process the image in some way
    else:  # this is a text block
        # extract / process the text

Details on the structure of each of the above blocks can be found here.

2 replies

AniketModi Mar 22, 2024
Author

Above code looks like little bit a complex. Let me understand and try it.

JorjMcKie Mar 22, 2024
Maintainer

Sure. Later / official versions of the script will hide most of it behind the curtain.

Then you can take the position "no details please".

JorjMcKie · 2024-03-22T11:19:24Z

JorjMcKie
Mar 22, 2024
Maintainer

Hm - I think it does work. Please let me have a problem page.

…

________________________________ Von: AniketModi ***@***.***> Gesendet: Freitag, 22. März 2024 07:07 An: pymupdf/PyMuPDF ***@***.***> Cc: Jorj X. McKie ***@***.***>; Comment ***@***.***> Betreff: Re: [pymupdf/PyMuPDF] Page content conversion suitable as input for LLM / RAG (Discussion #3293) The script will not handle the below use-case (where there are more than one table and there is a text in between tables), right ? Page has content in following order: 1. text 2. table 3. text 4. table — Reply to this email directly, view it on GitHub<#3293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIT6IR27A3PCX2AEWXLYZQGIXAVCNFSM6AAAAABFC6JZ4WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DQNZWG43TI>. You are receiving this because you commented.Message ID: ***@***.***>

5 replies

AniketModi Mar 25, 2024
Author

Hi @JorjMcKie ,
the extraction is working. I have just one doubt, if there is a one image but it is kind of separate blocks , is it possible to extract whole image at once. Attached the screenshot for your reference:

AniketModi Mar 25, 2024
Author

In the attached references , when I tried to select entire image, it is union of multiple blocks. So PyMuPDF is also getting each block as separate image. So, is it possible to extract whole at once ?

JorjMcKie Mar 25, 2024
Maintainer

While not having your PDF at hand, this looks like multiple images and / or vector graphics (a different animal!) are present.
You can either analyze details of the situation (as a programmer typically is interested in) or otherwise take an image of the full page, i.e. render it via page.get_pixmap() and make a PNG or JPEG or whatever from it.

You can easily detect whether the full page is covered by images by joining the rectangles of their bboxes and look at the size of the intersection of this union rectangle with the page rectangle.

Rectangle area size is computed via the abs() function, so you could implement a check like if abs(page.rect & union_rect) >= abs(page.rect) * 0.95: ... to take the pixmap action path when 95% of the page is covered by images / graphics.

AniketModi Apr 5, 2024
Author

The approach has worked for me. There is one doubt, I need to do for multi column pdf as well. Is it possible to handle multi column pdf as well with same code ?

JorjMcKie Apr 5, 2024
Maintainer

Neither PDF nor anything else knows about how many columns a page has.
If you know (or can find out) where the column borders are, you can make according sub-rectangles of the page ("clips") and then make separate pixmaps of the clip areas.

AniketModi · 2024-04-21T19:19:21Z

AniketModi
Apr 21, 2024
Author

I tested the entire flow before putting to the production but find_tables() is taking too much intensive CPU. Here, is the sample pdf to try. Can you help to suggest approaches such that it will not take so much high CPU.
annual-report-2022-2023.pdf

1 reply

JorjMcKie Apr 21, 2024
Maintainer

You can try and do simple text extraction. Tables are not the predominant object types on the pages.
The document is highly complex, lots and lots of images and vector graphics.
You simply have to expect that processing it will cost according resources.
As mentioned under issues, other packages either are fast and recognize no tables at all, or recognize a lot and consume even more CPU.

Page content conversion suitable as input for LLM / RAG #3293

Uh oh!

AniketModi Mar 22, 2024

Replies: 4 comments · 10 replies

Uh oh!

JorjMcKie Mar 22, 2024 Maintainer

Uh oh!

JorjMcKie Mar 22, 2024 Maintainer

Uh oh!

AniketModi Mar 22, 2024 Author

Uh oh!

JorjMcKie Mar 22, 2024 Maintainer

Uh oh!

AniketModi Mar 22, 2024 Author

Uh oh!

JorjMcKie Mar 22, 2024 Maintainer

Uh oh!

JorjMcKie Mar 22, 2024 Maintainer

Uh oh!

AniketModi Mar 25, 2024 Author

Uh oh!

AniketModi Mar 25, 2024 Author

Uh oh!

JorjMcKie Mar 25, 2024 Maintainer

Uh oh!

AniketModi Apr 5, 2024 Author

Uh oh!

JorjMcKie Apr 5, 2024 Maintainer

Uh oh!

AniketModi Apr 21, 2024 Author

Uh oh!

JorjMcKie Apr 21, 2024 Maintainer

AniketModi
Mar 22, 2024

Replies: 4 comments 10 replies

JorjMcKie
Mar 22, 2024
Maintainer

JorjMcKie Mar 22, 2024
Maintainer

AniketModi Mar 22, 2024
Author

JorjMcKie
Mar 22, 2024
Maintainer

AniketModi Mar 22, 2024
Author

JorjMcKie Mar 22, 2024
Maintainer

JorjMcKie
Mar 22, 2024
Maintainer

AniketModi Mar 25, 2024
Author

AniketModi Mar 25, 2024
Author

JorjMcKie Mar 25, 2024
Maintainer

AniketModi Apr 5, 2024
Author

JorjMcKie Apr 5, 2024
Maintainer

AniketModi
Apr 21, 2024
Author

JorjMcKie Apr 21, 2024
Maintainer