Annotation redaction creating unwanted white rectangles #2471

luclemot · 2023-06-14T14:54:20Z

luclemot
Jun 14, 2023

Bug description

Hi !
I'm getting a similar issue than on this stackoverflow question.
While trying to redact some text from a relatively dense pdf, the redaction works well but it adds an extra white rectangle per redaction which covers text I don't want redacted.

How it happened

I simply ran the following lines for each rectangle I wanted redacted:

page.add_redact_annot(rect, fill=white)
page.apply_redactions()

From my attempt at debugging this issue, it seems like the add_redact_annot is working fine (it crosses off the right rectangle when commenting out the next line) and that the issue comes from the apply_redactions method.

Outputs

Below is the result redaction. The text in red is replacement text I added. On the overlined screenshot, you can see the unwanted white rectangle.

Your configuration

Ubuntu
Python 3.10
PyMuPDF==1.22.3

Answered by JorjMcKie

Jun 14, 2023

No problem!
Here is the background:
Fonts have their in-built, "natural" line height. This is more often than not larger than the fontsize of the text. For e.g. Helvetica, The line height is 37.4% larger than the fontsize.
If the PDF creator has written his lines with a smaller distance than that, line (word, ...) bboxes and hit rectangles will overlap parts of preceeding or following lines.
And because redaction logic will kill every character overlapping the redact rectangle, you will see your undesired effect.

This can be done to solve it:
Before searching or extracting text, set a global PyMuPDF parameter such that only text (line) heights equal to the font size will be generated: fit…

View full answer

JorjMcKie · 2023-06-14T15:13:53Z

JorjMcKie
Jun 14, 2023
Maintainer

This is not a bug, but an often seen complication. Let me trasnfer this to "Discussions" first.

6 replies

luclemot Jun 14, 2023
Author

Oops, sorry ! Let me know if you have an idea how I can fix this.

JorjMcKie Jun 14, 2023
Maintainer

No problem!
Here is the background:
Fonts have their in-built, "natural" line height. This is more often than not larger than the fontsize of the text. For e.g. Helvetica, The line height is 37.4% larger than the fontsize.
If the PDF creator has written his lines with a smaller distance than that, line (word, ...) bboxes and hit rectangles will overlap parts of preceeding or following lines.
And because redaction logic will kill every character overlapping the redact rectangle, you will see your undesired effect.

This can be done to solve it:
Before searching or extracting text, set a global PyMuPDF parameter such that only text (line) heights equal to the font size will be generated: fitz.TOOLS.set_small_glyph_heights(True).
There is a fat chance that you will see your problem be gone afterwards.
If not, then the PDF creator was even stingier and you must become creative:
Define a sub-bbox of the to-be-removed bbox, for example one with the same width, but only 20% of its height: your text characters will still overlap it, but hopefully nothing else.

Answer selected by luclemot

luclemot Jun 14, 2023
Author

This worked ! Thanks so much.
Is there a quick way to set the fontsize to the maximum size that will fit ?

JorjMcKie Jun 14, 2023
Maintainer

What do you mean: "fontsize"?
When inserting text, or replacing text that was redacted away?

luclemot Jun 14, 2023
Author

I was talking about inserting text over the redacted textbox (but I could also just add the text during the redaction).

JorjMcKie Jun 14, 2023
Maintainer

Ok, I understand.

The general problem with letting the redaction process immediately also insert the new text, is that things are not under your direct control.
You can avoid this by doing your insertion. Once you know your text bbox of the old text:
Extract the text from this bbox only again - this time will all text properties possible:

block = page.get_text("dict", clip=bbox)["blocks"][0]  # should only be 1 block anyway!
span = block["lines"][0]["spans"][0]  # should work under the circumstances!
origin = span["origin"]  # original insertion point
fsize = span["size"]  # orig. fontsize
font = span["font"]  # font basename
bbox = fitz.Rect(span["bbox"])  # bbox (again - you should have it already

Then choose your output font, e.. font = fitz.Font("helv") for Helvetica - but you free here.
Then compute the length of you desired new text - best use fontsize = 1 for easier computation: tl = font.text_length(newtext, fontsize=1).
Now use the bbox width to compute a new fontsize:
newfontsize = bbox.width / tl

Finally add the redaction and apply it.
Then do page.insert_text(origin, newtext, fontsize=newfontsize, fontname="helv").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Annotation redaction creating unwanted white rectangles #2471

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Annotation redaction creating unwanted white rectangles #2471

Uh oh!

Uh oh!

luclemot Jun 14, 2023

Bug description

How it happened

Outputs

Your configuration

Replies: 1 comment · 6 replies

Uh oh!

JorjMcKie Jun 14, 2023 Maintainer

Uh oh!

luclemot Jun 14, 2023 Author

Uh oh!

JorjMcKie Jun 14, 2023 Maintainer

Uh oh!

luclemot Jun 14, 2023 Author

Uh oh!

JorjMcKie Jun 14, 2023 Maintainer

Uh oh!

luclemot Jun 14, 2023 Author

Uh oh!

JorjMcKie Jun 14, 2023 Maintainer

luclemot
Jun 14, 2023

Replies: 1 comment 6 replies

JorjMcKie
Jun 14, 2023
Maintainer

luclemot Jun 14, 2023
Author

JorjMcKie Jun 14, 2023
Maintainer

luclemot Jun 14, 2023
Author

JorjMcKie Jun 14, 2023
Maintainer

luclemot Jun 14, 2023
Author

JorjMcKie Jun 14, 2023
Maintainer