Annotating signature blocks #2409

djoltes · 2023-05-17T16:03:20Z

djoltes
May 17, 2023

Wondering if anyone has tried this...I'm working on a project to automatically redact PDFs, and the next step is to try removing handwritten signatures. I'm testing use of AWS Textract's signature capability; so far it seems to be working but I'm stuck on how to apply an annotation using the returned data.

It requires an image, so I'm grabbing the pixmap of a page and converting it to a jpg:

for page in doc:
        # _wrapContents is needed for fixing alignment issues with rect boxes 
        page.wrap_contents()
        pix = page.get_pixmap()
        page_jpg = pix.tobytes(output = 'jpg')
        sigs = textract_client.analyze_document(Document={'Bytes': page_jpg}, FeatureTypes = ["SIGNATURES"])    
        for item in sigs["Blocks"]:
            if item["BlockType"] == "SIGNATURE":
                print("Item: {a}, geometry {b}".format(a = item["Id"], 
                                                       b = item["Geometry"]))

Example output:

Item: 47ec04b6-a00f-4263-b540-db8fda022c18, geometry {'BoundingBox': {'Width': 0.310680091381073, 'Height': 0.05060103163123131, 'Left': 0.11123588681221008, 'Top': 0.6378791332244873}, 'Polygon': [{'X': 0.11123588681221008, 'Y': 0.6380218267440796}, {'X': 0.4218907356262207, 'Y': 0.6378791332244873}, {'X': 0.4219159781932831, 'Y': 0.6883060932159424}, {'X': 0.11127414554357529, 'Y': 0.6884801983833313}]}

Has anyone tried converting Textract coordinates into a set of fitz x/y values that will redact the identified signature block? It seems like it could be done with a fitz.rect() annotation, but I'm not sure if what Textract provides will match what PyMuPDF thinks are the coordinates.

Thx for any ideas...

JorjMcKie · 2023-05-17T21:44:12Z

JorjMcKie
May 17, 2023
Maintainer

I have worked with textract. The main thing is, that you are not being told the original's page dimension as far as I remember.
The coordinates of everything (blocks, lines, words, ...) are floats in range 0 to 1. So every coordinate is given relative to a page size fitz.Rect(0, 0, 1, 1).
So you have to feed it in from somewhere outside.
If you however magically know the page rectangle (let's call it page_rect), things are very simple:

compute transformation matrix mat = fitz.Rect(0, 0, 1, 1).torect(page_rect).
Then compute every coordinate given in textract geometry to PyMuPDF coordinates by multiplying it with mat.

8 replies

djoltes May 18, 2023
Author

I know -- I'm being asked to see whether Tesseract or Textract work best in our context, so I was hoping there might be a relatively simple method for using the latter. I doubt Textract is significantly better, but still need to look into it. I may need to read the whole returned dictionary for detected text, and assemble it into an object that can be passed to a fitz.open() call.

JorjMcKie May 18, 2023
Maintainer

< I may need to read the whole returned dictionary for detected text, and assemble it into an object that can be passed to a fitz.open() call.

I don't understand this part at all. It is a dictionary (of stacked, interconnected dicionaries) - not a PDF. The only way you could make use of it, is on the page level. So whenever you want text from the page, do not do page.get_text("dict") - but instead use the Textract dictionary.

Maybe it makes sense to first convert it to the dict format delivered by page.get_text("dict"), but that's detail.

djoltes May 18, 2023
Author

I probably said that badly, and was in a hurry. I was thinking I could just extract all the "Text" elements from the returned dictionary and use them to build a 1-page PDF, but that would obviously lose all the layout and formatting so it's a non-starter.

What might be interesting is if I could use Tesseract to build the PDF page as normal, but somehow grab the text from both its output and Textract's in order to compare results and maybe replace some poorly rendered words. At this point I'm really trying to evaluate which OCR option provides the best results (and am secretly hoping for Tesseract since I prefer open source).

JorjMcKie May 18, 2023
Maintainer

What might be interesting is if I could use Tesseract to build the PDF page as normal, but somehow grab the text from both its output and Textract's in order to compare results and maybe replace some poorly rendered words. At this point I'm really trying to evaluate which OCR option provides the best results (and am secretly hoping for Tesseract since I prefer open source).

That sounds doable. Take a scanned page and have it OCRed two times: Tesseract and Textract.
Then extract the text from the Tesseract alternative and draw all word rectangles on the page.
The draw the WORD rectangles from the Textract JSON file on (a second version of) the same page.
Save both outputs and compare the rectangles.

To have low effort, best make the Tesseract OCR via OCRmyPDF package.

djoltes May 18, 2023
Author

I'm even wondering if I can use something like Tika (which I already use) to grab the text from the Tesseract OCR page, and compare it with a string built by sequentially extracting all the 'Text' elements from the Textract output. Then it's a matter of string comparison and maybe some analysis of words in each to see how many bogus words are found.

Interestingly, it does appear Amazon has a Java-based Textract-to-PDF prototype. Too bad there's not one in Python! https://github.com/aws-samples/amazon-textract-searchable-pdf/blob/master/src/SearchablePDF/src/main/java/com/amazon/textract/pdf/PDFDocument.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Annotating signature blocks #2409

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Annotating signature blocks #2409

Uh oh!

djoltes May 17, 2023

Replies: 1 comment · 8 replies

Uh oh!

JorjMcKie May 17, 2023 Maintainer

Uh oh!

djoltes May 18, 2023 Author

Uh oh!

JorjMcKie May 18, 2023 Maintainer

Uh oh!

djoltes May 18, 2023 Author

Uh oh!

JorjMcKie May 18, 2023 Maintainer

Uh oh!

djoltes May 18, 2023 Author

djoltes
May 17, 2023

Replies: 1 comment 8 replies

JorjMcKie
May 17, 2023
Maintainer

djoltes May 18, 2023
Author

JorjMcKie May 18, 2023
Maintainer

djoltes May 18, 2023
Author

JorjMcKie May 18, 2023
Maintainer

djoltes May 18, 2023
Author