Automating PDF to JPG and text recognition #2473
Replies: 2 comments 8 replies
-
Maybe we end up looking at a file example, but let me first ask what the problem is specifically here: I can't recall right now which coordinate system pdfminer is actually using (PDF?). When you extract text together with its coordinates on the PDF page, then you will be dealing with points (1 inch = 72 points) and these are floats. The PDF standard coordinate system uses the page's bottom-left point as (0, 0). And maybe pdfminer is doing this too. PyMuPDF has several "geometry" objects like rectangles and matrices to deal with all that. Among other things, there is a function that computes a coordinate system transformation matrix to convert e.g. PDF page coordinates to corresponding coordinates inside an image that was rendered from that page. |
Beta Was this translation helpful? Give feedback.
-
I figured out what the issue was! In PDFMiner, the X and Y coordinates for the page size are reverse of what PyMuPDF uses (the page.rect with PyMuPDF is (0,0,3168,2448) but with PDFMiner the page.mediabox is (0,0,2448,3168). Thank you for all the help, I really appreciate it |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I'm trying to automate a process that takes a PDF, converts it to a JPG, reads and finds the text on the PDF, and uses the jpg with the location of the text in tableau to plot a coordinate. When doing this manually through Adobe, I would take the pdf, export to a postscript file, open that as a pdf, then export to a jpg. I want to try and automate this process so I'm using pymupdf to convert the pdf to a jpg currently.
I have the script working but the coordinates are not in line with the JPG, and I I'm not sure what the problem is. I use PDFMiner to do the text recognition on the pdf, does that have something to do with it? I realize pyMuPDF has text recognition abilities, but I am building off of someone else's work so I would prefer not to change too much of it if possible!
I'm not sure if it could be that the postscript flattened the pdf and without that step its not sizing up correctly? I did try doc.save('filename', garbage = 4,deflate = True) but that didnt change anything. It's been difficult to pinpoint the issue here as you can see.
I can send the code privately if needed, if that's ok.
Also this is my first time posting anything here, let me know if you need more info/detail.
Beta Was this translation helpful? Give feedback.
All reactions