Replies: 10 comments 2 replies
-
these are two completely different things
Text ModificationFairly complicated, but I have done that for text coded in ASCII. Probably could be extended to the full range of Latin (char codes < 256). Be prepaired to write a little string analyzer!
or
Characters outside the ASCII range are coded as Ok, so much for the background. To modify your text, you must hence first find it, potentially character by character, within something like the above, and replace each single one, without changing anything else. As I said: complicated but possible. Reading and rewriting the Invisible New TextNot supported yet. The only approximation I can think of is choosing white text color ... |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed answer! Perhaps the invisible text could be done via modifying contents as well? It seems that adding text render mode is appending "n Tr" where n is is the mode (0-7, 3 being invisible), see function code: |
Beta Was this translation helpful? Give feedback.
-
yes you are right: invisible text could easily be supported by exactly what you suggest. Whatever: I will take it on my list of enhancements for the next version. Thanks for submitting this! :-) |
Beta Was this translation helpful? Give feedback.
-
Hehe! So, I would add |
Beta Was this translation helpful? Give feedback.
-
I think so, but safer would be right after "ET" |
Beta Was this translation helpful? Give feedback.
-
Thanks, I'll experiment a bit and see 😊👍 |
Beta Was this translation helpful? Give feedback.
-
I was unable to get it to work at the end (before/after ET), but after inspecting some PDFs with invisible text, I got it to work by inserting for page in pdf:
for xref in page._getContents():
stream = pdf._getXrefStream(xref).replace(b'Tm', b'Tm\n3 Tr')
pdf._updateStream(xref, stream) This is, of course, extremely hacky, but it works for my purposes... Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Ah okay. I stand corrected. |
Beta Was this translation helpful? Give feedback.
-
You can also take a look at this.
This method is useful to me. |
Beta Was this translation helpful? Give feedback.
-
I've stumbled appon this issue while trying to erase some text that is in a set of strs. I've built some helper functions and it worked for me. Luckily my PDF file had only TJ operators.
But I think is rather easy to modify the code to support both. def get_sub_bytestrings_between(string: bytes, start: bytes, end: bytes) -> List[bytes]:
"""Get all the substrings between two strings.
Args:
string (bytes): Byte to search in.
start (bytes): Byte to start searching.
end (bytes): Byte to end searching.
Returns:
List[bytes]: List of substrings(bytes) between start and end.
"""
substrings: List[bytes] = list()
string_len: int = len(string)
start_index = 0
while start_index < string_len:
start_index = string.find(start, start_index)
if start_index == -1:
break
start_index += len(start)
end_index = string.find(end, start_index)
if end_index == -1:
break
substrings.append(string[start_index:end_index])
start_index = end_index + len(end)
return substrings
def get_substrings_between(string: str, start: str, end: str) -> List[str]:
"""Get all the substrings between two strings.
Args:
string (str): String to search in.
start (str): String to start searching.
end (str): String to end searching.
Returns:
List[str]: List of substrings between start and end.
"""
substrings: List[str] = list()
string_len: int = len(string)
start_index = 0
while start_index < string_len:
start_index = string.find(start, start_index)
if start_index == -1:
break
start_index += len(start)
end_index = string.find(end, start_index)
if end_index == -1:
break
substrings.append(string[start_index:end_index])
start_index = end_index + len(end)
return substrings
def decode_TJ_text(text: str) -> str:
"""Decode the text in TJ operator.
Args:
text (str): Text to decode.
Returns:
str: Decoded text.
"""
substrings: list[str] = get_substrings_between(
text,
"(",
")",
)
# check if any substring if of type (\nnn ex:\xf1) and decode it(encoding is oct(ord(c)))
for i, substring in enumerate(substrings):
if substring.startswith("\\") and len(substring) == 4:
substrings[i] = chr(int("0o" + substring[1:], 8))
elif substring.startswith("\\"):
substrings[i] = substring.replace("\\", "")
return "".join(substrings) And then I use them like this: #search_strings are the strings to be "erased" (replaced by "").
for page in doc:
for xref in page.get_contents():
stream = doc.xref_stream(xref)
bytestring_array: List[bytes] = get_sub_bytestrings_between(
stream, b"BT", b"ET"
)
for bytestring in bytestring_array:
try:
decoded_text = decode_TJ_text(
b"".join(bytestring.split(b"TJ")).decode("utf-8")
)
if decoded_text in search_strings:
stream = stream.replace(bytestring, b"() 4.0 TJ")
doc.update_stream(xref, stream)
except UnicodeDecodeError:
#left it like this as it didn't modify the PDF output visually.
continue @JorjMcKie It would be super nice if the editing feature (with no annots) was supported natively. I now know it would be a mess to implement it tho 🤣. If you find some issues or a better solution please leave your comment! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to modify and/or delete text in a PDF? For example change case of certain words, or delete certain names.
Also, is it possible to insert invisible text? (reportlab does this via
text.setTextRenderMode(3)
)I've been going through the documentation without any luck.
Beta Was this translation helpful? Give feedback.
All reactions