Editing text in PDF #2269

mikkelee · 2019-02-28T05:43:07Z

mikkelee
Feb 28, 2019

Is it possible to modify and/or delete text in a PDF? For example change case of certain words, or delete certain names.

Also, is it possible to insert invisible text? (reportlab does this via text.setTextRenderMode(3))

I've been going through the documentation without any luck.

JorjMcKie · 2019-02-28T10:48:13Z

JorjMcKie
Feb 28, 2019
Maintainer

these are two completely different things

Modifying text of a previously existing PDF
Creating new text on a (new or existing) page.

Text Modification

Fairly complicated, but I have done that for text coded in ASCII. Probably could be extended to the full range of Latin (char codes < 256). Be prepaired to write a little string analyzer!
You must write code that interprets the page's /Context object(s) (could be more than one). You can read those objects as byte objects. Any text of the page should be found inside the string pair b"BT" and b"ET" (markers for begin text / end text).
Then, within those BT/ET substrings, the actual text is found enclosed in "( ...)" or "[ ...]" brackets depending on one of the operators "Tj" or "TJ" respectively, that follow these brackets.
Examples for an encoded text "Jorj X. McKie":

(Jorj X. McKie) Tj or <4a6f726a20582e204d634b6965> Tj if that text is given as a hex string (then the round brackets are replaced by "<>").

or

[(Jjorj) 3.4 (X.) 3.4 (McKie)] TJ or [<4a6f726a> 3.4 <582e> 3.4 <4d634b6965>] TJ.

Characters outside the ASCII range are coded as \nnn where the three digits nnn are the octal oct(ord(c)), so you would see \344 for the German umlaut "ä". Characters "(", ")" and backslash are escaped by prefixing them with an additional backslash. Those escapes are of course unnecessary if the text is given in hex.
The difference between the variants "Tj", "TJ" is the ability of "TJ" to control the spacing of each single character: the numbers 3.4 mean a space of width 3.4 inserted between text pieces. These numbers can be negative, to e.g. achieve overprinting effects.

Ok, so much for the background.

To modify your text, you must hence first find it, potentially character by character, within something like the above, and replace each single one, without changing anything else.

As I said: complicated but possible. Reading and rewriting the /Contents objects is supported by PyMuPDF, and I have once written a program that replaced given text by "xxx" of the same length (for confidentiality motives).

Invisible New Text

Not supported yet. The only approximation I can think of is choosing white text color ...

0 replies

mikkelee · 2019-02-28T12:32:31Z

mikkelee
Feb 28, 2019
Author

Thanks for the detailed answer!

Perhaps the invisible text could be done via modifying contents as well? It seems that adding text render mode is appending "n Tr" where n is is the mode (0-7, 3 being invisible), see function code:
https://github.com/Distrotech/reportlab/blob/48cafb6d64ff92fd9d4f9a4dd888be6f7d55b765/src/reportlab/pdfgen/textobject.py#L345 (line 366 in partcular)

0 replies

JorjMcKie · 2019-02-28T12:40:54Z

JorjMcKie
Feb 28, 2019
Maintainer

yes you are right: invisible text could easily be supported by exactly what you suggest.
This is a f*ing parameter actually: a typical PDF invention making thing unnecessarily complex. Some applications even use this command to simulate bold text (believe that was 2 Tr), etc.

Whatever: I will take it on my list of enhancements for the next version.
In the meantime you can of course also insert this command manually if you want ...

Thanks for submitting this! :-)

0 replies

mikkelee · 2019-02-28T12:45:06Z

mikkelee
Feb 28, 2019
Author

Hehe! So, I would add n Tr after the TJ/Tj, correct?

0 replies

JorjMcKie · 2019-02-28T12:51:57Z

JorjMcKie
Feb 28, 2019
Maintainer

I think so, but safer would be right after "ET"

0 replies

mikkelee · 2019-02-28T12:53:51Z

mikkelee
Feb 28, 2019
Author

Thanks, I'll experiment a bit and see 😊👍

0 replies

mikkelee · 2019-02-28T13:43:34Z

mikkelee
Feb 28, 2019
Author

I was unable to get it to work at the end (before/after ET), but after inspecting some PDFs with invisible text, I got it to work by inserting Tr after Tm:

for page in pdf:
    for xref in page._getContents():
        stream = pdf._getXrefStream(xref).replace(b'Tm', b'Tm\n3 Tr')
        pdf._updateStream(xref, stream)

This is, of course, extremely hacky, but it works for my purposes...

Thanks again!

0 replies

JorjMcKie · 2019-02-28T15:04:31Z

JorjMcKie
Feb 28, 2019
Maintainer

Ah okay. I stand corrected.
Thank you!

0 replies

sunshinewithmoonlight · 2020-06-20T02:21:13Z

sunshinewithmoonlight
Jun 20, 2020

You can also take a look at this.
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
It's near page 302.
For example :

...replace(b'Td', b' 3 Tr Td')

This method is useful to me.

0 replies

LucasGrasso · 2023-02-26T18:28:30Z

LucasGrasso
Feb 26, 2023

I've stumbled appon this issue while trying to erase some text that is in a set of strs. I've built some helper functions and it worked for me. Luckily my PDF file had only TJ operators.

[(Jjorj) 3.4 (X.) 3.4 (McKie)] TJ or [<4a6f726a> 3.4 <582e> 3.4 <4d634b6965>] TJ.

But I think is rather easy to modify the code to support both.

def get_sub_bytestrings_between(string: bytes, start: bytes, end: bytes) -> List[bytes]:
    """Get all the substrings between two strings.
    Args:
        string (bytes): Byte to search in.
        start  (bytes): Byte to start searching.
        end  (bytes): Byte to end searching.
    Returns:
        List[bytes]: List of substrings(bytes) between start and end.
    """

    substrings: List[bytes] = list()
    string_len: int = len(string)

    start_index = 0

    while start_index < string_len:
        start_index = string.find(start, start_index)
        if start_index == -1:
            break
        start_index += len(start)
        end_index = string.find(end, start_index)
        if end_index == -1:
            break
        substrings.append(string[start_index:end_index])
        start_index = end_index + len(end)

    return substrings


def get_substrings_between(string: str, start: str, end: str) -> List[str]:
    """Get all the substrings between two strings.
    Args:
        string (str): String to search in.
        start (str): String to start searching.
        end (str): String to end searching.
    Returns:
        List[str]: List of substrings between start and end.
    """

    substrings: List[str] = list()
    string_len: int = len(string)

    start_index = 0

    while start_index < string_len:
        start_index = string.find(start, start_index)
        if start_index == -1:
            break
        start_index += len(start)
        end_index = string.find(end, start_index)
        if end_index == -1:
            break
        substrings.append(string[start_index:end_index])
        start_index = end_index + len(end)

    return substrings


def decode_TJ_text(text: str) -> str:
    """Decode the text in TJ operator.
    Args:
        text (str): Text to decode.
    Returns:
        str: Decoded text.
    """

    substrings: list[str] = get_substrings_between(
        text,
        "(",
        ")",
    )
    # check if any substring if of type (\nnn ex:\xf1) and decode it(encoding is oct(ord(c)))
    for i, substring in enumerate(substrings):
        if substring.startswith("\\") and len(substring) == 4:
            substrings[i] = chr(int("0o" + substring[1:], 8))
        elif substring.startswith("\\"):
            substrings[i] = substring.replace("\\", "")

    return "".join(substrings)

And then I use them like this:

#search_strings are the strings to be "erased" (replaced by "").

for page in doc:
        for xref in page.get_contents():
            stream = doc.xref_stream(xref)
            bytestring_array: List[bytes] = get_sub_bytestrings_between(
                stream, b"BT", b"ET"
            )
            for bytestring in bytestring_array:
                try:
                    decoded_text = decode_TJ_text(
                        b"".join(bytestring.split(b"TJ")).decode("utf-8")
                    )
                    if decoded_text in search_strings:
                        stream = stream.replace(bytestring, b"() 4.0 TJ")
                        doc.update_stream(xref, stream)
                except UnicodeDecodeError:
                    #left it like this as it didn't modify the PDF output visually.
                    continue

@JorjMcKie It would be super nice if the editing feature (with no annots) was supported natively. I now know it would be a mess to implement it tho 🤣.

If you find some issues or a better solution please leave your comment!

2 replies

JorjMcKie Feb 27, 2023
Maintainer

I understand your desire!
And I hate to disappoint your expectations: we will not create a text editing feature on this (low) level. You can still use redactions annotations to make text changes - and achieve considerable results with this if you want to invest the time.
There may also be other future extensions like exporting to a Word format. This however is not in any immediate plans.

LucasGrasso Feb 27, 2023

Regardless of this, excelent package! It works really fine.

Editing text in PDF #2269

Uh oh!

mikkelee Feb 28, 2019

Replies: 10 comments · 2 replies

Uh oh!

JorjMcKie Feb 28, 2019 Maintainer

Text Modification

Invisible New Text

Uh oh!

mikkelee Feb 28, 2019 Author

Uh oh!

JorjMcKie Feb 28, 2019 Maintainer

Uh oh!

mikkelee Feb 28, 2019 Author

Uh oh!

JorjMcKie Feb 28, 2019 Maintainer

Uh oh!

mikkelee Feb 28, 2019 Author

Uh oh!

Uh oh!

mikkelee Feb 28, 2019 Author

Uh oh!

JorjMcKie Feb 28, 2019 Maintainer

Uh oh!

Uh oh!

sunshinewithmoonlight Jun 20, 2020

Uh oh!

Uh oh!

LucasGrasso Feb 26, 2023

Uh oh!

JorjMcKie Feb 27, 2023 Maintainer

Uh oh!

LucasGrasso Feb 27, 2023

mikkelee
Feb 28, 2019

Replies: 10 comments 2 replies

JorjMcKie
Feb 28, 2019
Maintainer

mikkelee
Feb 28, 2019
Author

JorjMcKie
Feb 28, 2019
Maintainer

mikkelee
Feb 28, 2019
Author

JorjMcKie
Feb 28, 2019
Maintainer

mikkelee
Feb 28, 2019
Author

mikkelee
Feb 28, 2019
Author

JorjMcKie
Feb 28, 2019
Maintainer

sunshinewithmoonlight
Jun 20, 2020

LucasGrasso
Feb 26, 2023

JorjMcKie Feb 27, 2023
Maintainer