Column Boundaries do not match when drawn using OpenCV #2867

rudra0713 · 2023-12-04T22:24:14Z

rudra0713
Dec 4, 2023

I have a sample pdf which I have converted to images using the pdf2image tool. I have used the script here,
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/multi_column.py
to acquire the column boundaries on that pdf. When I draw the column boundaries onto the PDF using instructions from the "main" function in the multi_column.py script, they look fine. However, when I use the OpenCV tool to draw the boundaries on the PDF images, they don't match anymore.

The compatibility is important for me because I am already using tools like Doctr and Tesseract to get token-level annotation and the token boundaries I get from those are perfectly fine when drawn using Opencv.
Is there a conversion ratio for the bounding boxes from PyMuPDF so that they can be properly modified that's suitable for Opencv?

Alternatively, is there a way to properly use an image as input for the PyMuPDF tool? I tried doing that, but the computed column boundaries are empty when an image is given as input.

Answered by JorjMcKie

Dec 6, 2023

Your last comment sheds some more light on the problem.

You are determining some boundary boxes (bbox) for text on the page.
Your downstream tools (as OpenCV) are converting the page to an image at a certain resolution. This introduces the following complications
- An image has integer dimensions. Everything on the image obviously also is addressable by integer coordinates. A PDF page has float dimensions: width and height need not be integers, the text bboxes have float coordinates.
- The chosen image resolution (DPI) also changes the dimension: an A4 PDF page (width 595.0, height 842.0) will be turned to an 1240x1755 image when rendered with a DPI of 150.
- You therefore need to convert all…

View full answer

JorjMcKie · 2023-12-05T09:16:56Z

JorjMcKie
Dec 5, 2023
Maintainer

This is no issue but a typical Discussions post - converting ...

0 replies

JorjMcKie · 2023-12-05T09:18:37Z

JorjMcKie
Dec 5, 2023
Maintainer

You did not include the PDF, but from looking at the picture, probably all you need to do is page.clean_contents() or page.wrap_contents() before you draw those rectangles.

7 replies

JorjMcKie Dec 5, 2023
Maintainer

On another note, PyMuPDF/MuPDF use a page geometry where point (0,0) is the top-left of the page.
In PDF this is the bottom-left of a page.
I don't know what these other packages assume, but chances are they also use PDF geometry. In which case you must transform the rectangles produced by PyMuPDF back to PDF's coordinate system.

JorjMcKie Dec 5, 2023
Maintainer

This can be done by multiplying all PyMuPDF rectangles with the inverse of the page transformation matrix, like rect * ~page.transformation_matrix.

rudra0713 Dec 5, 2023
Author

Hi @JorjMcKie,
The green drawings are done by Opencv tool, not PyMuPDF.
Thanks for your suggestion, but it did not solve my issue.
Here's my sample code:

import opencv, fitz

pdf_file_path = "sample.pdf"
image_path = "sample_pdf_image.jpg"  # path for the image of a specific page in the sample.pdf

image = cv2.imread(image_path)
doc = fitz.open(pdf_file_path)

page = doc[page_number]
page.wrap_contents()

bboxes = column_boxes(page, footer_margin=50, header_margin=50)

for i, rect_org in enumerate(bboxes):
    rect = rect_org * ~page.transformation_matrix
    cv2.rectangle(image, (rect[0], rect[1]), (rect[2], rect[3]), (0, 255, 0), 2)


filename_write = 'sample_pdf_image_with_column_boundaries.jpg'
cv2.imwrite(filename_write, image)

Even after using the transformation matrix, the rectangles remain on the same position as before (all on the left as shown with green boxes in the attached image above.)

I did some additional experiments. I create the Pdf images using the pdf2image tool where I set the dpi value to be 300. I have changed the dpi value from 50, 60, 70, .., 100, 200, 300. When dpi value is set to 70, then the rectangles from PyMuPDF tend to be more aligned with the column boundaries but, not completely. I have attached the picture where I generated

the image with 70 dpi and drawn rectangles from PyMuPDF using Opencv (before using the transformation matrix.)

I am still looking for a better solution.
Is there a way that I can input the image instead of the pdf directly to PyMuPDF? When I tried that, it did not throw any error but no rectangles were returned from the tool.

JorjMcKie Dec 6, 2023
Maintainer

Your last comment sheds some more light on the problem.

You are determining some boundary boxes (bbox) for text on the page.
Your downstream tools (as OpenCV) are converting the page to an image at a certain resolution. This introduces the following complications
- An image has integer dimensions. Everything on the image obviously also is addressable by integer coordinates. A PDF page has float dimensions: width and height need not be integers, the text bboxes have float coordinates.
- The chosen image resolution (DPI) also changes the dimension: an A4 PDF page (width 595.0, height 842.0) will be turned to an 1240x1755 image when rendered with a DPI of 150.
- You therefore need to convert all your previously generated bboxes to image coordinates which you then hand down to OpenCV

To do this job, take the DPI used by OpenCV and make a Pixmap: pix = page.get_pixmap(dpi=DPI). Then compute a Matrix that converts page coordinates to Pixmap coordinates: mat = page.rect.torect(pix.irect).

Every rectangle in PyMuPDF has method torect as shown.

Then, for every bbox from above do bbox * mat to obtain its image coordinate version. If you multiply a rectangle with a Matrix like this, the result will again be a rectangle.

Answer selected by rudra0713

rudra0713 Dec 7, 2023
Author

Thanks a lot for your suggestion. It worked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Column Boundaries do not match when drawn using OpenCV #2867

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Column Boundaries do not match when drawn using OpenCV #2867

Uh oh!

rudra0713 Dec 4, 2023

Replies: 2 comments · 7 replies

Uh oh!

JorjMcKie Dec 5, 2023 Maintainer

Uh oh!

JorjMcKie Dec 5, 2023 Maintainer

Uh oh!

JorjMcKie Dec 5, 2023 Maintainer

Uh oh!

JorjMcKie Dec 5, 2023 Maintainer

Uh oh!

rudra0713 Dec 5, 2023 Author

Uh oh!

JorjMcKie Dec 6, 2023 Maintainer

Uh oh!

rudra0713 Dec 7, 2023 Author

rudra0713
Dec 4, 2023

Replies: 2 comments 7 replies

JorjMcKie
Dec 5, 2023
Maintainer

JorjMcKie
Dec 5, 2023
Maintainer

JorjMcKie Dec 5, 2023
Maintainer

JorjMcKie Dec 5, 2023
Maintainer

rudra0713 Dec 5, 2023
Author

JorjMcKie Dec 6, 2023
Maintainer

rudra0713 Dec 7, 2023
Author