Automating PDF to JPG and text recognition #2473

maimaik05 · 2023-06-15T23:05:14Z

maimaik05
Jun 15, 2023

Hello!

I'm trying to automate a process that takes a PDF, converts it to a JPG, reads and finds the text on the PDF, and uses the jpg with the location of the text in tableau to plot a coordinate. When doing this manually through Adobe, I would take the pdf, export to a postscript file, open that as a pdf, then export to a jpg. I want to try and automate this process so I'm using pymupdf to convert the pdf to a jpg currently.

I have the script working but the coordinates are not in line with the JPG, and I I'm not sure what the problem is. I use PDFMiner to do the text recognition on the pdf, does that have something to do with it? I realize pyMuPDF has text recognition abilities, but I am building off of someone else's work so I would prefer not to change too much of it if possible!

I'm not sure if it could be that the postscript flattened the pdf and without that step its not sizing up correctly? I did try doc.save('filename', garbage = 4,deflate = True) but that didnt change anything. It's been difficult to pinpoint the issue here as you can see.

I can send the code privately if needed, if that's ok.
Also this is my first time posting anything here, let me know if you need more info/detail.

JorjMcKie · 2023-06-15T23:53:06Z

JorjMcKie
Jun 15, 2023
Maintainer

but the coordinates are not in line with the JPG

Maybe we end up looking at a file example, but let me first ask what the problem is specifically here:
Not matching at all (completely some place else) or just a bit mispositioned?

I can't recall right now which coordinate system pdfminer is actually using (PDF?).

When you extract text together with its coordinates on the PDF page, then you will be dealing with points (1 inch = 72 points) and these are floats. The PDF standard coordinate system uses the page's bottom-left point as (0, 0). And maybe pdfminer is doing this too.
Image addressing in contrast works with pixels using integers, and point (0, 0) is the top-left corner.

PyMuPDF has several "geometry" objects like rectangles and matrices to deal with all that. Among other things, there is a function that computes a coordinate system transformation matrix to convert e.g. PDF page coordinates to corresponding coordinates inside an image that was rendered from that page.
Potentially that might help you.

8 replies

JorjMcKie Jun 19, 2023
Maintainer

Won't work like that.
I suggest you look at the source code of method .torect() in the Rect class and try to replicate that logic in whatever your preferred environment is.
It is Python code, it should be possible to do that with reasonable effort.

maimaik05 Jun 20, 2023
Author

I'm trying to find the source code, but can't seem to locate it. Which folder is it in?

JorjMcKie Jun 20, 2023
Maintainer

Look here, line 512.

maimaik05 Jun 20, 2023
Author

I'm sorry, I'm having trouble seeing how I can use the returned matrix to save it as a pdf. Can it be used in one of the parameters when using Document.save()?

JorjMcKie Jun 20, 2023
Maintainer

No. But you can compute it as described and store it's 6 values as a JSON file.
As I wrote, that matrix transform document page coordinates to image coordinates. If all pages of the document have the same width/height, the so have the produced images and you can use the same matrix for all pages of the document.
In you application, read the JSON file (which will be 6 floats) and use it for coordinate computation.

maimaik05 · 2023-06-21T14:19:43Z

maimaik05
Jun 21, 2023
Author

I figured out what the issue was! In PDFMiner, the X and Y coordinates for the page size are reverse of what PyMuPDF uses (the page.rect with PyMuPDF is (0,0,3168,2448) but with PDFMiner the page.mediabox is (0,0,2448,3168). Thank you for all the help, I really appreciate it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automating PDF to JPG and text recognition #2473

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Automating PDF to JPG and text recognition #2473

Uh oh!

maimaik05 Jun 15, 2023

Replies: 2 comments · 8 replies

Uh oh!

JorjMcKie Jun 15, 2023 Maintainer

Uh oh!

JorjMcKie Jun 19, 2023 Maintainer

Uh oh!

maimaik05 Jun 20, 2023 Author

Uh oh!

JorjMcKie Jun 20, 2023 Maintainer

Uh oh!

maimaik05 Jun 20, 2023 Author

Uh oh!

JorjMcKie Jun 20, 2023 Maintainer

Uh oh!

maimaik05 Jun 21, 2023 Author

maimaik05
Jun 15, 2023

Replies: 2 comments 8 replies

JorjMcKie
Jun 15, 2023
Maintainer

JorjMcKie Jun 19, 2023
Maintainer

maimaik05 Jun 20, 2023
Author

JorjMcKie Jun 20, 2023
Maintainer

maimaik05 Jun 20, 2023
Author

JorjMcKie Jun 20, 2023
Maintainer

maimaik05
Jun 21, 2023
Author