Alignment of OCR text when using Textract or Rekognition #2552

djoltes · 2023-07-20T15:31:39Z

djoltes
Jul 20, 2023

I'm having a problem with alignment of redactions when using AWS Rekognition (and occasionally Textract), and wonder if I need to alter how I'm computing page locations for placing the OCR text (and therefore the redactions). When managing normal text everything is fine, and the word alignment is good. The problem is on larger font sizes, since we have a requirement to redact things like license plate numbers and other oddly sized texts.

The code to set up for OCR using this is as follows:

font = fitz.Font("helv")
[...]
pix = page.get_pixmap()
page_jpg = pix.tobytes(output = 'jpg')
img = np.asarray(bytearray(page_jpg), dtype="uint8")
img = cv2.imdecode(img, 0)
iHeight, iWidth = img.shape[:2]

Once I'm iterating over the returned OCR results, I'm using the following to compute dimensions of the bbox for the invisible text.

geo = item["Geometry"]
box = geo["BoundingBox"]
x0 = box["Left"] * iWidth # left side
y0 = box["Top"] * iHeight
height = box["Height"] * iHeight 
width = box["Width"] * iWidth
x1 = x0 + width # computed width
y1 = y0 + height # computed height

matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect)
ocrRect = fitz.Rect(x0, y0, x1, y1)
bbox = ocrRect * matrix

(fontSize = bbox.width / textLen was also tried, but occasionally causes near full-page redactions)

textLen = font.text_length(ocrText, fontsize=1)
fontSize = ocrRect.width / textLen
               
page.insert_text(ocrRect.bl, 
                 ocrText,
                 fontsize = fontSize,
                 fontname = "helv",
                 render_mode = 3)

But when the results appear, they're offset in some way -- usually like one of these examples:

This has to be somehow related to how coordinates are being computed for non-standard font sizes, but I'm not sure how to overcome it, since we have no way of knowing what font size the detected text in an image is. Has anyone tried this and had success? Any hints or ideas appreciated.

JorjMcKie · 2023-07-22T15:00:53Z

JorjMcKie
Jul 22, 2023
Maintainer

I'm still traveling and only have computer access on Tuesday.
The pivotal thing with Textract is to feed in the page size, because those coordinates are relative to the unit rectangle. So compute matrix m = fitz.Rect(0,0,1,1).torect(page.rect) with the given page. Then build a fitz.Rect "rect" with the Textract coordinates.
rect * m will then be the correct PyMuPDF coordinates to work with.

2 replies

djoltes Jul 24, 2023
Author

Yep, I'm using that method and it works for normal text but not when an image (see the photos when you get back and have time) is involved. Same maths, but the placement of the OCR and thereby the redaction blocks is offset. There may be something wrong with my logic, but that the OCR works on normal text and fails elsewhere seems odd.

JorjMcKie Jul 26, 2023
Maintainer

Reconstruction of the original font size in OCRed text is a highly error-prone business. For example, if original text was "area between edges", then you will get 3 different bbox heights, one for each of the 3 words, even when that text was written with the same font and font size.
In lucky circumstances, Textract will recognize that the words belong to the same LINE and provide a common bbox.
You still don't know the original font, so you have to assume one. Then take that hypothetical font's "ascender" and "descender" properties, check the presence of characters reaching below the baseline (like "g") or above the so-called "x-height" (like "b", "t", "d" above). With all this information, compute the insertion point ("origin") of the words or the line and the font size.
If your assumed font is different from the original, the previous computed results will still be wrong, or only approximating the correct ones.

As explained before, the whole computation is based on using exactly the same page dimension as was used by Textract. Your pictures show large location differences, suggesting that this was not the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alignment of OCR text when using Textract or Rekognition #2552

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Alignment of OCR text when using Textract or Rekognition #2552

Uh oh!

Uh oh!

djoltes Jul 20, 2023

Replies: 1 comment · 2 replies

Uh oh!

JorjMcKie Jul 22, 2023 Maintainer

Uh oh!

djoltes Jul 24, 2023 Author

Uh oh!

JorjMcKie Jul 26, 2023 Maintainer

djoltes
Jul 20, 2023

Replies: 1 comment 2 replies

JorjMcKie
Jul 22, 2023
Maintainer

djoltes Jul 24, 2023
Author

JorjMcKie Jul 26, 2023
Maintainer