Cannot find text in Page.read_contents() #3019

jcgeo9 · 2024-01-12T07:06:47Z

jcgeo9
Jan 12, 2024

Description of the bug

Well I am actually a bit confused. I have a PDF document which was processed using JAVA (pdfbox package) adding a watermark as an overlay on the background (this is for context)
When I open the file and use get_text(), I am able to see the block with the watermark text, but when I use read_contents in order to remove this, the text is missing. Can you please explain what am I missing here?
Also, is there another way to identify this overlayed text? Is there a way to remove a block returned from get_text without having to access the contents directly?

Thanks in advance

How to reproduce the bug

Unfortunately I cannot provide the document because of sensitive data but what I am doing is:

diagonal_blocks_in_pages={}
for pg_id, fitzp in enumerate(doc):
    diagonal_blocks_in_pages[pg_id]=[]
    for bl in fitzp.get_text("dict")["blocks"]:
        if bl["type"] == 0:
            if not "lines" in bl:
                continue
            for line in bl["lines"]:
                x1, y1, x2, y2 = line["bbox"]
                angle = math.atan2(y2 - y1, x2 - x1) * 180 / math.pi
                if (angle>35 and angle<55) or (angle>125 and angle<145) or (angle>215 and angle<235) or (angle>305 and angle<325):
                    if "spans" in line:
                        for span in line["spans"]:
                            diagonal_blocks_in_pages[pg_id].append(span["text"])

if diagonal_blocks_in_pages:
    d=list(diagonal_blocks_in_pages.values())
    common_elements = list(set.intersection(*[set(x) for x in d]))
    if common_elements:
        for page in doc:
            for com in common_elements:
                for xref in page.get_contents():
                    stream = self.doc.xref_stream(xref).replace(com.encode(), b' ')
                    self.doc.update_stream(xref, stream)

PyMuPDF version

1.23.9

Operating system

Linux

Python version

3.10

JorjMcKie · 2024-01-12T07:22:27Z

JorjMcKie
Jan 12, 2024
Maintainer

This is no bug, but a Discussions item.

0 replies

JorjMcKie · 2024-01-12T07:32:49Z

JorjMcKie
Jan 12, 2024
Maintainer

Things mostly look very different than expected once they have landed in the /Contents. Consult one of the PDF manuals to learn about the syntax of the language used therein.
Especially for text you will usually not recognize it in the contents, because every character has been converted to a glyph number which in turn is a pointer to a handful of stroking commands stored in the font file.
Text extraction is capable to do the required back-conversion to Unicodes by a number of methods - which each depend on the font type.

So in the end, your text is not missing - you just do not recognize it.

7 replies

JorjMcKie Jan 12, 2024
Maintainer

You always can remove text objects inside a /Contents by walking through it and delete sections within the markers b"BT" and b"ET". But you (mostly) cannot know what it is that you are removing.
Brute force / trial-and-error approaches may finally let you find your watermark in this way.

jcgeo9 Jan 18, 2024
Author

Okay, then is there a way to just either find the text through get_text("dict") and remove it from there?
Or just change the background either by putting a new image or just changing its color?
Or just copying the entire text of the page line by line without the watermark text?
any of these will do i guess, is any of this possible?

JorjMcKie Jan 18, 2024
Maintainer

Once a text was written on a page with whatever coordinates and font size, it will (normally) be appended to the /Contents object.
Or a new such object is being created and appended to the page's /Contents array.

One could therefore walk through the contents objects and locate the the last b"BT" string (in the last such objects).
Then delete the part of the string up to b"ET". Finally update the respective /Contents object using its xref via doc.update_stream(xref, new_cont).

That's about the most safe approach.

The condition is that your watermark indeed was created by writing text across the page ... there are multiple possible alternatives.

jcgeo9 Jan 18, 2024
Author

Then the question that comes to mind then is this. if everything displayed in the page is returned by the get_text() function, how exactly does this function operates in order to decode the contents? Let me elaborate:
For example the watermark in a page is "ABCDEFG" and it cannot be found through the contents of the page as it is but the text is returned through the get_text() function (which is what is happening to me). How is the get_text function doing the decoding of the bytes from the content to string in order to return the "ABCDEFG" watermark?
Maybe if I understand this better, I will be able to do what needs to be done.
Btw thanks for your quick responses so far, appreciate it!

JorjMcKie Jan 18, 2024
Maintainer

Maybe if I understand this better, I will be able to do what needs to be done.

The sad news is this: you won't.
The following contents write "Hello, World!" on a page starting at insertion point (100, 100) on a ISO A4 page:

q
BT
1 0 0 1 100 742 Tm
/F0 11 Tf [<007301a9021b021b02440e85000301330244027c021b01960e89>]TJ
ET
Q

The text is contained in brackets "[< ... >]" in 4-digit hex numbers (this is even a very simple case!).
Each number does not represent a character (Unicode), but the number of a glyph: 0x0073 is a pointer to a small program inside the font's binary that produces the visual appearance of "H").
The fact that it represents the character "H" is buried in a special table for that font, the character map (CMAP). Consists of almost 2000 lines in this case.
After compressing the font (doc.subset_fonts()): that CMAP looks like this:

doc.subset_fonts()
777252
page.get_fonts()
[(5, 'ttf', 'Type0', 'WYXCMS+FiraGO Regular', 'F0', 'Identity-H')]
print(doc.xref_object(5))
<<
  /Type /Font
  /Subtype /Type0
  /BaseFont /WYXCMS+FiraGO#20Regular
  /Encoding /Identity-H
  /ToUnicode 13 0 R    # a new CMAP was made by font sub-setting
  /DescendantFonts [ 17 0 R ]
>>
print(doc.xref_stream(13).decode())  # CMAP content
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <</Registry(Adobe)/Ordering(UCS)/Supplement 0>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
11 beginbfchar
<0003> <0020>
<0073> <0048>   # glyph 0x73 corresponds to Unicode 0x48, chr(0x48)="H"
<0133> <0057>
<0196> <0064>
<01a9> <0065>
<021b> <006c>
<0244> <006f>
<027c> <0072>
<02e9> <00ff>
<0e85> <002c>
<0e89> <0021>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The problem is that neither the [...] TJ instruction always looks so simple, nor do all fonts have a CMAP. Also, a CMAP may look much more complicated. The font selected here is a TTF font. Other fonts are made very differently.

Instead of hex glyph numbers inside the TJ array, you may find multiple glyph number sections interrupted by spacing instructions.
Instead of instruction TJ you may also have instruction Tj, which has a different format
Instead of a CMAP, a font may not have anything like that, but refer to a standard encoding like "WinANSIEncoding" which needs to be deciphered in a different way. There are a handful of different such encoding tables.

Please look up the details about all that in one of the Adobe manuals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot find text in Page.read_contents() #3019

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cannot find text in Page.read_contents() #3019

Uh oh!

jcgeo9 Jan 12, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 2 comments · 7 replies

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

jcgeo9 Jan 18, 2024 Author

Uh oh!

JorjMcKie Jan 18, 2024 Maintainer

Uh oh!

jcgeo9 Jan 18, 2024 Author

Uh oh!

JorjMcKie Jan 18, 2024 Maintainer

jcgeo9
Jan 12, 2024

Replies: 2 comments 7 replies

JorjMcKie
Jan 12, 2024
Maintainer

JorjMcKie
Jan 12, 2024
Maintainer

JorjMcKie Jan 12, 2024
Maintainer

jcgeo9 Jan 18, 2024
Author

JorjMcKie Jan 18, 2024
Maintainer

jcgeo9 Jan 18, 2024
Author

JorjMcKie Jan 18, 2024
Maintainer