Zoom into a pdf within lambda and having issues. #1598

meggievh · 2022-02-17T15:34:35Z

meggievh
Feb 17, 2022

Essentially the copy I create is exactly the same as the original copy. I really just want to at least zoom in such that the border margins would go away.

file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))

doc = fitz.open()
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, filename)
fs = obj.get()['Body'].read()
pdf=fitz.open("pdf", stream=BytesIO(fs))
 # open stream as PDF
rect = pdf[0].rect
sizeA4 = fitz.paper_size("A4")
page = doc.new_page(width = sizeA4[0], height = sizeA4[1])
page.show_pdf_page(rect, pdf, 0)

new_bytes=doc.write()




bucketname2='modified'
s3.Bucket(bucketname2).put_object(Key=filename, Body=new_bytes)

JorjMcKie · 2022-02-17T15:40:52Z

JorjMcKie
Feb 17, 2022
Maintainer

Looks like this is more a question than an issue.
So I will transfer this to the Discussions tab.

0 replies

JorjMcKie · 2022-02-17T15:44:01Z

JorjMcKie
Feb 17, 2022
Maintainer

Ok, now.
I am still confused as to what are you trying to achieve:
You have some PDF page that you don't want to show fully?

0 replies

JorjMcKie · 2022-02-17T15:47:29Z

JorjMcKie
Feb 17, 2022
Maintainer

If this is the case, you could simply decrease the page's cropbox. E.g. do this:

r = page.rect + (72, 72, -72, -72)
# the above rect omits 1 inch (= 72 points) from all the borders
page.set_cropbox(r)
# done

0 replies

meggievh · 2022-02-17T15:48:16Z

meggievh
Feb 17, 2022
Author

I have a pdf scanned image that is small in the right hand corner of the page. When I zoom in locally using os.system(f'pdf-crop-margins {file_pdf} -o {outputpdf}') it removes margins and the OCR works much better. I want to do something similar using fitz (as it is easier to import into a lambda). Thank you!

…

On Thu, Feb 17, 2022 at 7:44 AM Jorj X. McKie ***@***.***> wrote: Ok, now. I am still confused as to what are you trying to achieve: You have some PDF page that you don't want to show fully? — Reply to this email directly, view it on GitHub <#1598 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEIFDJZAKGOWLMY7LINLLU3UJUZANCNFSM5OVCDQVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

JorjMcKie Feb 17, 2022
Maintainer

Ah, ok. Would my previous post help with this?

meggievh · 2022-02-17T15:54:02Z

meggievh
Feb 17, 2022
Author

I tried the code and I think I need something maybe that zooms in to get the similar effect as the other package. r = page.rect + (72, 72, -72, -72) # the above rect omits 1 inch (= 72 points) from all the borders page.set_cropbox(r)

…

On Thu, Feb 17, 2022 at 7:49 AM Jorj X. McKie ***@***.***> wrote: Ah, ok. Would my previous post help with this? — Reply to this email directly, view it on GitHub <#1598 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEIFGCMDC2JUUBKZJMITDU3UKKFANCNFSM5OVCDQVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

JorjMcKie · 2022-02-17T15:59:13Z

JorjMcKie
Feb 17, 2022
Maintainer

What you could also do is taking a (high resolution) RGB Pixmap of the page's wanted part and make an OCR from this:

r= page.rect + (....)
pix = page.get_pixmap(dpi=300, clip=r)
pdfbytes = pix.pdfocr_tobytes()
ocrpdf = fitz.open("pdf", pdfbytes)
ocrpage=ocrpdf[0]
# now extract your text
text = ocrpage.get_text()
# or any of the other get_text() variants

1 reply

JorjMcKie Feb 17, 2022
Maintainer

The dpi parameter is equivalent to zooming.
I am not sure whether you indicated that also need to distort the scan, because the original was not positioned completely flat on the scanning machine.
If so, then you maybe could use the new Pixmap method warp() to make a rectangle again from something that looks more like a trapezoid ...

JorjMcKie · 2022-02-17T16:01:58Z

JorjMcKie
Feb 17, 2022
Maintainer

If you need text position coordinates with respect to the original scanned page, you can also compute them fairly easily ... let me know.

0 replies

meggievh · 2022-02-18T00:52:19Z

meggievh
Feb 18, 2022
Author

Essentially I am a bit stuck on using this within a lambda too. I need the zoomed image to be a pdf in order to leverage textract. I must have deleted the code too that made this save that that doc.write works def lambda_handler(event, context): textract = boto3.client("textract") #if event: file_obj = event["Records"][0] bucketname = str(file_obj["s3"]["bucket"]["name"]) filename = unquote_plus(str(file_obj["s3"]["object"]["key"])) doc = fitz.open() s3 = boto3.resource('s3') obj = s3.Object(bucketname, filename) fs = obj.get()['Body'].read() pdf=fitz.open("pdf", stream=BytesIO(fs)) # open stream as PDF page = pdf[0] #sizeA4 = fitz.paper_size("A4") r = page.rect + (72, 72, -72, -72) #the above rect omits 1 inch (= 72 points) from all the borders page.set_cropbox(r) #page.show_pdf_page(rect, pdf, 0) new_bytes=doc.write()

…

On Thu, Feb 17, 2022 at 8:07 AM Jorj X. McKie ***@***.***> wrote: The dpi parameter is equivalent to zooming. I am not sure whether you indicated that also need to distort the scan, because the original was not positioned completely flat on the scanning machine. If so, then you maybe could use the new Pixmap method warp() to make a rectangle again from something that looks more like a trapezoid ... — Reply to this email directly, view it on GitHub <#1598 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEIFHJCJUQFLIGXSJ7PDTU3UMM5ANCNFSM5OVCDQVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

JorjMcKie · 2022-02-18T07:26:51Z

JorjMcKie
Feb 18, 2022
Maintainer

I need the zoomed image to be a pdf in order to leverage textract.

But my snippet using this pixmap approach, does produce an (intermediate) PDF! With one page that contains the original page image (with cropped border) and an underlying OCRed text layer.

You also could stick with your original PDF page, too. Set the page's cropbox as described and then do this to OCR and extract the text:

page.set_cropbox(r)
tp = page.get_textpage_ocr(dpi=300, full=True)
# the above performs the OCR, then makes a textpage to extract from:
text = page.get_text("text", textpage=tp)
# or other text output formats
blocks = page.get_text("dict", textpage=tp)["blocks"]
# etc.

This produces the same result as working with pixmaps.

0 replies

meggievh · 2022-02-19T12:13:24Z

meggievh
Feb 19, 2022
Author

I am using textract so I need to have just a zoomed in version of the pdf. I have tried this with other image formats and did not work as well. I am able to crop with code, but it is not zoomed in.

…

On Thu, Feb 17, 2022 at 11:27 PM Jorj X. McKie ***@***.***> wrote: I need the zoomed image to be a pdf in order to leverage textract. But my snippet using this pixmap approach, does produce an (intermediate) PDF! With one page that contains the original page image (with cropped border) and an underlying OCRed text layer. You also could stick with your original PDF page, too. Set the page's cropbox as described and then do this to OCR and extract the text: page.set_cropbox(r)tp = page.get_textpage_ocr(dpi=300, full=True)# the above performs the OCR, then makes a textpage to extract from:text = page.get_text("text", textpage=tp)# or other text output formatsblocks = page.get_text("dict", textpage=tp)["blocks"]# etc. This produces the same result as working with pixmaps. — Reply to this email directly, view it on GitHub <#1598 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEIFFIYJBKLUTKPCFQGETU3XYELANCNFSM5OVCDQVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

JorjMcKie · 2022-02-20T10:28:09Z

JorjMcKie
Feb 20, 2022
Maintainer

I am beginning to suspect, that what you refer to as textract is something outside PyMuPDF - correct?
Else I don't understand your problem: if you use the dpi parameter in one of the above ways, together with cropping/clipping the page area, you are zooming.

0 replies

meggievh · 2022-02-21T14:11:02Z

meggievh
Feb 21, 2022
Author

My reason for using fitz is to just take in a pdf, zoom in, and then resave it zoomed in, so I can later use other OCR packages. When would I use cover_to_pdf on a pixmaps. That is the step I am missing I believe pdfbytes = doc.convert_to_pdf() # this a bytes object I would prefer to not have to save in s3 then reread in

…

On Sun, Feb 20, 2022 at 2:28 AM Jorj X. McKie ***@***.***> wrote: I am beginning to suspect, that what you refer to as textract is something outside PyMuPDF - correct? Else I don't understand your problem: if you use the dpi parameter in one of the above ways, together with cropping/clipping the page area, you *are zooming*. — Reply to this email directly, view it on GitHub <#1598 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGEIFC365HTDIAVRON3WGDU4C64JANCNFSM5OVCDQVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

JorjMcKie · 2022-02-21T14:26:04Z

JorjMcKie
Feb 21, 2022
Maintainer

Ah, this is a completely different thing!
If you have a scanned pdf (i.e. every page is an image), you can make a zoomed version of each page and save to a new PDF with thus modified pages.

outpdf = fitz.open()
for page in inpdf:
    pix = page.get_pixmap(dpi=300, clip=...)  # make a zoomed in version of the relevant page part
    outpage = outpdf.new_page(width=pix.width, height=pix.height)
    outpage.insert_image(outpage.rect, pixmap=pix)
outpdf.close(...,garbage=4,deflate=True)

0 replies

Zoom into a pdf within lambda and having issues. #1598

Uh oh!

meggievh Feb 17, 2022

Replies: 13 comments · 2 replies

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

meggievh Feb 17, 2022 Author

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

meggievh Feb 17, 2022 Author

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

JorjMcKie Feb 17, 2022 Maintainer

Uh oh!

meggievh Feb 18, 2022 Author

Uh oh!

JorjMcKie Feb 18, 2022 Maintainer

Uh oh!

meggievh Feb 19, 2022 Author

Uh oh!

JorjMcKie Feb 20, 2022 Maintainer

Uh oh!

meggievh Feb 21, 2022 Author

Uh oh!

JorjMcKie Feb 21, 2022 Maintainer

meggievh
Feb 17, 2022

Replies: 13 comments 2 replies

JorjMcKie
Feb 17, 2022
Maintainer

JorjMcKie
Feb 17, 2022
Maintainer

JorjMcKie
Feb 17, 2022
Maintainer

meggievh
Feb 17, 2022
Author

JorjMcKie Feb 17, 2022
Maintainer

meggievh
Feb 17, 2022
Author

JorjMcKie
Feb 17, 2022
Maintainer

JorjMcKie Feb 17, 2022
Maintainer

JorjMcKie
Feb 17, 2022
Maintainer

meggievh
Feb 18, 2022
Author

JorjMcKie
Feb 18, 2022
Maintainer

meggievh
Feb 19, 2022
Author

JorjMcKie
Feb 20, 2022
Maintainer

meggievh
Feb 21, 2022
Author

JorjMcKie
Feb 21, 2022
Maintainer