Double highlight annotations in PDF #2781

djoltes · 2023-11-02T16:01:58Z

djoltes
Nov 2, 2023

I'm working on a new solution to highlight (rather than annotate) existing PDFs; everything is working as desired except that the output file has >1 layers of highlight annotations over marked texts.

The input files were already OCR'd, but by unknown software so we're passing everything through Textract then adding a new OCR layer using its output. To remove the existing OCR I've been making an initial pass through the input document, grabbing each page as an image and writing that to a new output document, e.g.:

doc = fitz.open(f)   
    wDoc = fitz.open()
    mat = fitz.Matrix(2, 2)
    for i, page in enumerate(doc):
        rect = page.rect
        pix = page.get_pixmap(matrix=mat, dpi = IMG_DENSITY)
        # pix.pil_tobytes()
        png = pix.tobytes("png")
        wDoc._newPage(width=rect.width, height=rect.height)
        out_page = wDoc.load_page(i)
        out_page.insert_image(rect, stream=png)
    # close the read-only doc since we don't need it anymore
    doc.close()

Then I start iterating over the wDoc a page at a time. This strategy worked to remove one layer of highlights, but others remain. Often they're nearly at exactly the same bbox location. Code that's performing the highlighting is below ("piiType" is the type returned by Comprehend so I can label the box appropriately).

areas = page.search_for(itemToRedact)
                            
# draw outline over sensitive data areas based on coordinates on the page
 for area in areas:
     print("Highlighting: {a}, type: {b}, location: {c}".format(
     a = itemToRedact, 
     b = piiType, 
     c = area)) 
      hl = page.add_highlight_annot(area)
      info = hl.info
      info["content"] = piiType
      hl.set_info(info) 
       hl.update()

(logging output)
Highlighting: 9/18/20, type: DATE_TIME, location: Rect(381.8072204589844, 342.2568054199219, 402.019287109375, 350.5815734863281)
Highlighting: 9/18/20, type: DATE_TIME, location: Rect(380.0889892578125, 339.8741760253906, 402.8880310058594, 349.26446533203125)
Flagged PII entity: 12:26:13 AM. Score: 0.999984860420227
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(410.4409484863281, 342.2568054199219, 444.794189453125, 350.5815734863281)
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(411.2460021972656, 340.730224609375, 434.4125671386719, 348.90875244140625)
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(434.4125671386719, 339.89935302734375, 445.86700439453125, 349.09423828125)
Flagged PII entity: 2015. Score: 0.9998883008956909
Highlighting: 2015, type: DATE_TIME, location: Rect(285.7590026855469, 395.2548522949219, 300.1438903808594, 404.1419372558594)
Flagged PII entity: 2019. Score: 0.9972389936447144
Highlighting: 2019, type: DATE_TIME, location: Rect(425.68499755859375, 395.1988525390625, 440.02020263671875, 404.0552673339844)
Flagged PII entity: /18/20. Score: 0.9999856948852539
Highlighting: /18/20, type: DATE_TIME, location: Rect(385.1759033203125, 342.2568054199219, 402.019287109375, 350.5815734863281)
Highlighting: /18/20, type: DATE_TIME, location: Rect(383.8888244628906, 339.8741760253906, 402.8880310058594, 349.26446533203125)

The question is, why? As far as I can tell, at this point a given page should only consist of the image and the OCR laid down by Textract, so where are all the extra markups coming from?

Conversely, is there a way to limit the page.search_for() results to a single 'layer'? Or is that just plain ridiculous?

JorjMcKie · 2023-11-02T16:37:16Z

JorjMcKie
Nov 2, 2023
Maintainer

A bit hard to understand. This is what I am getting:

you make an image-only version of the source PDF
you let TEXTRACT OCR it
you search for stuff in the text OCRed by TEXTRACT
the resulting rectangles of the searches are intermittently showing (almost) equal values.

If this is correct so far (i.e. nothing else has touched the image-only PDF), then the only logical conclusion is that those multiple rectangle copies come from OCRed text - i.e. generated by TEXTRACT for whatever reason ... maybe in an effort to simulate text boldness?

One way may be to simply join rectangles that are almost the same: if abs( r1 & r2) >= 0.98 * min(abs(r1), abs(r2)): r1 |= r2. This joins r2 into r1 if their intersection is larger than 98% of any of the two.

2 replies

djoltes Nov 2, 2023
Author

The order of operations you mention is correct. Item #3 is accomplished via Comprehend for PII detection, and involves passing the assembled OCR text string to the service and reading the result. It doesn't return any coordinates, just the PII type and location start/stop (so "EMAIL" and "X, Y"). I just grab the text at the start/stop locations and pass it to page.search_for(), then iterate over the results. Example Comprehend output for a given item:

'Type': 'EMAIL', 'BeginOffset': 82, 'EndOffset': 101}, {'Score': 0.9998766183853149

I've looked over the Textract OCR output and don't see any repetitions, so I can't see how multiple layers might be laid down but I'll take a closer look.

The abs test might work, and I'll give that a try as well.

djoltes Nov 2, 2023
Author

Ha, I think I found it -- it's all in how Textract returns results. A given child element can be either a WORD or LINE; I thought this reflected whether the OCR found contiguous or non-contiguous words, so I was consuming both types.

if item["BlockType"] == "LINE" or item["BlockType"] == "WORD":

Turns out you can just use WORD and it captures everything. My test documents so far all look perfect; just need to verify no corner cases, then we're in good shape.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Double highlight annotations in PDF #2781

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Double highlight annotations in PDF #2781

Uh oh!

djoltes Nov 2, 2023

Replies: 1 comment · 2 replies

Uh oh!

JorjMcKie Nov 2, 2023 Maintainer

Uh oh!

djoltes Nov 2, 2023 Author

Uh oh!

djoltes Nov 2, 2023 Author

djoltes
Nov 2, 2023

Replies: 1 comment 2 replies

JorjMcKie
Nov 2, 2023
Maintainer

djoltes Nov 2, 2023
Author

djoltes Nov 2, 2023
Author