Replies: 1 comment 2 replies
-
A bit hard to understand. This is what I am getting:
If this is correct so far (i.e. nothing else has touched the image-only PDF), then the only logical conclusion is that those multiple rectangle copies come from OCRed text - i.e. generated by TEXTRACT for whatever reason ... maybe in an effort to simulate text boldness? One way may be to simply join rectangles that are almost the same: |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on a new solution to highlight (rather than annotate) existing PDFs; everything is working as desired except that the output file has >1 layers of highlight annotations over marked texts.
The input files were already OCR'd, but by unknown software so we're passing everything through Textract then adding a new OCR layer using its output. To remove the existing OCR I've been making an initial pass through the input document, grabbing each page as an image and writing that to a new output document, e.g.:
Then I start iterating over the wDoc a page at a time. This strategy worked to remove one layer of highlights, but others remain. Often they're nearly at exactly the same bbox location. Code that's performing the highlighting is below ("piiType" is the type returned by Comprehend so I can label the box appropriately).
(logging output)
Highlighting: 9/18/20, type: DATE_TIME, location: Rect(381.8072204589844, 342.2568054199219, 402.019287109375, 350.5815734863281)
Highlighting: 9/18/20, type: DATE_TIME, location: Rect(380.0889892578125, 339.8741760253906, 402.8880310058594, 349.26446533203125)
Flagged PII entity: 12:26:13 AM. Score: 0.999984860420227
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(410.4409484863281, 342.2568054199219, 444.794189453125, 350.5815734863281)
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(411.2460021972656, 340.730224609375, 434.4125671386719, 348.90875244140625)
Highlighting: 12:26:13 AM, type: DATE_TIME, location: Rect(434.4125671386719, 339.89935302734375, 445.86700439453125, 349.09423828125)
Flagged PII entity: 2015. Score: 0.9998883008956909
Highlighting: 2015, type: DATE_TIME, location: Rect(285.7590026855469, 395.2548522949219, 300.1438903808594, 404.1419372558594)
Flagged PII entity: 2019. Score: 0.9972389936447144
Highlighting: 2019, type: DATE_TIME, location: Rect(425.68499755859375, 395.1988525390625, 440.02020263671875, 404.0552673339844)
Flagged PII entity: /18/20. Score: 0.9999856948852539
Highlighting: /18/20, type: DATE_TIME, location: Rect(385.1759033203125, 342.2568054199219, 402.019287109375, 350.5815734863281)
Highlighting: /18/20, type: DATE_TIME, location: Rect(383.8888244628906, 339.8741760253906, 402.8880310058594, 349.26446533203125)
The question is, why? As far as I can tell, at this point a given page should only consist of the image and the OCR laid down by Textract, so where are all the extra markups coming from?
Conversely, is there a way to limit the page.search_for() results to a single 'layer'? Or is that just plain ridiculous?
Beta Was this translation helpful? Give feedback.
All reactions