Replacing text in pdf using regex #2808

danilyef · 2023-11-15T12:59:28Z

danilyef
Nov 15, 2023

I want to substitute text in the pdf files using the following regex:

def remove_passwords(input_string):
    # Define a regular expression pattern to match the specified substrings
    pattern = re.compile(r'\b(?:pwd|password|passwort|kennwort|pw)\s*[ :=]\s*\S+', flags=re.IGNORECASE)

    # Use sub() to replace the matched substrings with an empty string
    result_string = re.sub(pattern, 'password', input_string)

    return result_string

I know that there exists replace method for xref_stream:


import fitz

fname = r"original_doc.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)

for page in doc:
    for xref in page.get_contents():
        stream = doc.xref_stream(xref).replace(b'Password', b'!!!')
        doc.update_stream(xref, stream)

how can I have the same functionality as replace but with a regex provided above? If it's not possible what workaround can you suggest?

Answered by JorjMcKie

Nov 15, 2023

To also erase a word following one of the found keywords, you could take a hit rect, enlarge it until the right page border, extract the words inside the result.
This should give you a list of words, where the first is the "password" literal and the second is (hopefully) the password itself.
The take the rectangle of that second item and join it with the hit rectangle to make a common redaction for both, or simple create another redact annot for the second item.
Like that (r being a hit rect of the search):

temp = +r  # copy of r
temp.x1=page.rect.width  # extend to page border
words = page.get_text("words", clip=temp)
pw = words[1]  # the password item itself like (x0, y0, x1, y1, "secre…

View full answer

JorjMcKie · 2023-11-15T14:15:41Z

JorjMcKie
Nov 15, 2023
Maintainer

This will not work as straightforwardly as you hope - but there is something close:
You need to use redaction annotations. This requires that you know the bbox (rectangle) of the text to be replaced. Then do

rl = page.search_for("password")  # case-insensitive, potentially multiple occurrences
for bbox in rl:
    page.add_redact_annot(bbox, "!!!")  # mark this bbox as to erased, note replacement text
page.apply_redactions()  # this "executes" all redaction annots

I observed that you also want to include the German versions for "password".
Simply to this:

rl = page.search_for("password")
rl.extend(page.search_for("kennwort"))
rl.extend(page.search_for("passwort"))
# now the redaction loop
for bbox in rl:
    page.add_redact_annot(bbox, "!!!")  # mark this bbox as to erased, note replacement text
page.apply_redactions()

0 replies

danilyef · 2023-11-15T16:14:45Z

danilyef
Nov 15, 2023
Author

@JorjMcKie Thank you for your answer! I tried your solution it works Ok, but unfortunately it overlaps other rectangles as well (hence, changes neighbour text).

My idea was to use redaction annotations and extend it a little bit along the x-axis (and not touching other rectangles), so that it can delete word "password" and text after it (actual password).

Something like that:

rl = page.search_for("password")
rl.extend(page.search_for("kennwort"))
rl.extend(page.search_for("passwort"))
# now the redaction loop
for bbox in rl:
    bbox[2] += 5 #extending x axis (right side)
    page.add_redact_annot(bbox, "!!!")  # mark this bbox as to erased, note replacement text
page.apply_redactions()

out_fname = r"final.pdf"
doc.save(out_fname,garbage=4, deflate=True)

But unfortunately it overlaps other rectangles and changes them as well (with and without bbox[2] += 5).
Do you have any further advice/recommendations?

3 replies

JorjMcKie Nov 15, 2023
Maintainer

The redaction process deletes everything overlapping any redaction rectangle. So you you have the following options:

Let PyMuPDF compute with minimized text rectangles by globally setting fitz.TOOLS.set_small_glyph_heights(True) before doing anything else (including search). This often already solves the problem.
The PDF creator may have deciding to pack text even closer together. In this case, you must shrink your search rectangles even more. E.g. only take the bottom 20% of each when you define the redact annot. Instead of rect, take rect + (0, 0.75*rect.height, 0, 0). Causes a thinner rect that still overlaps any of the encountered "needle" characters - and thus will be deleted, but hopefully nothing else:

In [33]: rect = fitz.Rect(100,100,300,200)
In [34]: rect +(0,0.75*rect.height,0,0)
Out[34]: Rect(100.0, 175.0, 300.0, 200.0)
In [35]:

JorjMcKie Nov 15, 2023
Maintainer

To also erase a word following one of the found keywords, you could take a hit rect, enlarge it until the right page border, extract the words inside the result.
This should give you a list of words, where the first is the "password" literal and the second is (hopefully) the password itself.
The take the rectangle of that second item and join it with the hit rectangle to make a common redaction for both, or simple create another redact annot for the second item.
Like that (r being a hit rect of the search):

temp = +r  # copy of r
temp.x1=page.rect.width  # extend to page border
words = page.get_text("words", clip=temp)
pw = words[1]  # the password item itself like (x0, y0, x1, y1, "secret",...)
# make another redact from pw[:4], the secret rectangle

Answer selected by danilyef

danilyef Nov 16, 2023
Author

Thank you for your answer! First option worked great!

One small question: is it possible to make inserted "words" with the same style as neighbour words?

JorjMcKie · 2023-11-16T12:33:03Z

JorjMcKie
Nov 16, 2023
Maintainer

Thank you for your answer! First option worked great!

One small question: is it possible to make inserted "words" with the same style as neighbour words?

Yes - with some more effort.

Determine the styles of the neighbors and cover cases where these are not equal. Preferably, determine the "span" in page.get_text("dict") that contains the word to be replaced: the properties of that span dict are the ones you want.
When defining the redaction that erases the needle, do not specify a replacement text. So an empty space will result.
After having applied the redaction(s), begin to insert replacement text in those areas. For each span identified above, insert the new next using font, font size, color as found in the span dict. The insertion point should be derived from span["origin"]: use the y-value of the origin, but set the x-value to the x-value found in the needle rectangle.

BTW finding the font of the original text usually is a pain in the neck:

fonts in PDF are usually subset fonts: they do not contain all characters. Your new text will likely contain characters that are not in that subset.
this means that you usually will have to hunt down the original font file and use it in your text insertion. But where and how to find it? Clearly nothing that can be automated programmatically.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replacing text in pdf using regex #2808

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Replacing text in pdf using regex #2808

Uh oh!

danilyef Nov 15, 2023

Replies: 3 comments · 3 replies

Uh oh!

JorjMcKie Nov 15, 2023 Maintainer

Uh oh!

danilyef Nov 15, 2023 Author

Uh oh!

JorjMcKie Nov 15, 2023 Maintainer

Uh oh!

JorjMcKie Nov 15, 2023 Maintainer

Uh oh!

danilyef Nov 16, 2023 Author

Uh oh!

JorjMcKie Nov 16, 2023 Maintainer

danilyef
Nov 15, 2023

Replies: 3 comments 3 replies

JorjMcKie
Nov 15, 2023
Maintainer

danilyef
Nov 15, 2023
Author

JorjMcKie Nov 15, 2023
Maintainer

JorjMcKie Nov 15, 2023
Maintainer

danilyef Nov 16, 2023
Author

JorjMcKie
Nov 16, 2023
Maintainer