Replies: 13 comments 2 replies
-
Looks like this is more a question than an issue. |
Beta Was this translation helpful? Give feedback.
-
Ok, now. |
Beta Was this translation helpful? Give feedback.
-
If this is the case, you could simply decrease the page's cropbox. E.g. do this: r = page.rect + (72, 72, -72, -72)
# the above rect omits 1 inch (= 72 points) from all the borders
page.set_cropbox(r)
# done |
Beta Was this translation helpful? Give feedback.
-
I have a pdf scanned image that is small in the right hand corner of the
page. When I zoom in locally using os.system(f'pdf-crop-margins {file_pdf}
-o {outputpdf}') it removes margins and the OCR works much better. I want
to do something similar using fitz (as it is easier to import into a
lambda).
Thank you!
…On Thu, Feb 17, 2022 at 7:44 AM Jorj X. McKie ***@***.***> wrote:
Ok, now.
I am still confused as to what are you trying to achieve:
You have some PDF page that you don't want to show fully?
—
Reply to this email directly, view it on GitHub
<#1598 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEIFDJZAKGOWLMY7LINLLU3UJUZANCNFSM5OVCDQVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I tried the code and I think I need something maybe that zooms in to get
the similar effect as the other package.
r = page.rect + (72, 72, -72, -72)
# the above rect omits 1 inch (= 72 points) from all the borders
page.set_cropbox(r)
…On Thu, Feb 17, 2022 at 7:49 AM Jorj X. McKie ***@***.***> wrote:
Ah, ok. Would my previous post help with this?
—
Reply to this email directly, view it on GitHub
<#1598 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEIFGCMDC2JUUBKZJMITDU3UKKFANCNFSM5OVCDQVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
What you could also do is taking a (high resolution) RGB Pixmap of the page's wanted part and make an OCR from this: r= page.rect + (....)
pix = page.get_pixmap(dpi=300, clip=r)
pdfbytes = pix.pdfocr_tobytes()
ocrpdf = fitz.open("pdf", pdfbytes)
ocrpage=ocrpdf[0]
# now extract your text
text = ocrpage.get_text()
# or any of the other get_text() variants |
Beta Was this translation helpful? Give feedback.
-
If you need text position coordinates with respect to the original scanned page, you can also compute them fairly easily ... let me know. |
Beta Was this translation helpful? Give feedback.
-
Essentially I am a bit stuck on using this within a lambda too. I need the
zoomed image to be a pdf in order to leverage textract.
I must have deleted the code too that made this save that that doc.write
works
def lambda_handler(event, context):
textract = boto3.client("textract")
#if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
doc = fitz.open()
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, filename)
fs = obj.get()['Body'].read()
pdf=fitz.open("pdf", stream=BytesIO(fs))
# open stream as PDF
page = pdf[0]
#sizeA4 = fitz.paper_size("A4")
r = page.rect + (72, 72, -72, -72)
#the above rect omits 1 inch (= 72 points) from all the borders
page.set_cropbox(r)
#page.show_pdf_page(rect, pdf, 0)
new_bytes=doc.write()
…On Thu, Feb 17, 2022 at 8:07 AM Jorj X. McKie ***@***.***> wrote:
The dpi parameter is equivalent to zooming.
I am not sure whether you indicated that also need to distort the scan,
because the original was not positioned completely flat on the scanning
machine.
If so, then you maybe could use the new Pixmap method warp() to make a
rectangle again from something that looks more like a trapezoid ...
—
Reply to this email directly, view it on GitHub
<#1598 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEIFHJCJUQFLIGXSJ7PDTU3UMM5ANCNFSM5OVCDQVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
But my snippet using this pixmap approach, does produce an (intermediate) PDF! With one page that contains the original page image (with cropped border) and an underlying OCRed text layer. You also could stick with your original PDF page, too. Set the page's cropbox as described and then do this to OCR and extract the text: page.set_cropbox(r)
tp = page.get_textpage_ocr(dpi=300, full=True)
# the above performs the OCR, then makes a textpage to extract from:
text = page.get_text("text", textpage=tp)
# or other text output formats
blocks = page.get_text("dict", textpage=tp)["blocks"]
# etc. This produces the same result as working with pixmaps. |
Beta Was this translation helpful? Give feedback.
-
I am using textract so I need to have just a zoomed in version of the pdf.
I have tried this with other image formats and did not work as well. I am
able to crop with code, but it is not zoomed in.
…On Thu, Feb 17, 2022 at 11:27 PM Jorj X. McKie ***@***.***> wrote:
I need the zoomed image to be a pdf in order to leverage textract.
But my snippet using this pixmap approach, does produce an (intermediate)
PDF! With one page that contains the original page image (with cropped
border) and an underlying OCRed text layer.
You also could stick with your original PDF page, too. Set the page's
cropbox as described and then do this to OCR and extract the text:
page.set_cropbox(r)tp = page.get_textpage_ocr(dpi=300, full=True)# the above performs the OCR, then makes a textpage to extract from:text = page.get_text("text", textpage=tp)# or other text output formatsblocks = page.get_text("dict", textpage=tp)["blocks"]# etc.
This produces the same result as working with pixmaps.
—
Reply to this email directly, view it on GitHub
<#1598 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEIFFIYJBKLUTKPCFQGETU3XYELANCNFSM5OVCDQVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I am beginning to suspect, that what you refer to as |
Beta Was this translation helpful? Give feedback.
-
My reason for using fitz is to just take in a pdf, zoom in, and then resave
it zoomed in, so I can later use other OCR packages.
When
would I use cover_to_pdf on a pixmaps. That is the step I am missing I
believe
pdfbytes = doc.convert_to_pdf() # this a bytes object
I would prefer to not have to save in s3 then reread in
…On Sun, Feb 20, 2022 at 2:28 AM Jorj X. McKie ***@***.***> wrote:
I am beginning to suspect, that what you refer to as textract is
something outside PyMuPDF - correct?
Else I don't understand your problem: if you use the dpi parameter in one
of the above ways, together with cropping/clipping the page area, you *are
zooming*.
—
Reply to this email directly, view it on GitHub
<#1598 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEIFC365HTDIAVRON3WGDU4C64JANCNFSM5OVCDQVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Ah, this is a completely different thing! outpdf = fitz.open()
for page in inpdf:
pix = page.get_pixmap(dpi=300, clip=...) # make a zoomed in version of the relevant page part
outpage = outpdf.new_page(width=pix.width, height=pix.height)
outpage.insert_image(outpage.rect, pixmap=pix)
outpdf.close(...,garbage=4,deflate=True) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Essentially the copy I create is exactly the same as the original copy. I really just want to at least zoom in such that the border margins would go away.
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
Beta Was this translation helpful? Give feedback.
All reactions