Matching the width of existing text on PDF #3267

gokhanbektas · 2024-03-15T14:44:52Z

gokhanbektas
Mar 15, 2024

I am developing a tool for changing the color theme of a PDF with text and diagrams. I want to find all shapes and text on a PDF, then change "only the color" of everything.

I am new to PyMuPDF. But I managed to iterate all shapes and replace them with another color easily.

Now, I want to do the same with text. The PDF has embedded fonts. I can extract fonts, find font sizes, find bounding boxes of the existing text and replace them with my color preference.

There is one problem I couldn't fix effectively yet.

The horizontal spacing of the characters on the original PDF are not always the same.

When I iterate the pages and the text; I can get the bounding box for the original text with:
span["bbox"]

I am drawing a rect at the bounding box and I can see that it perfectly matches the original text. I can also find the embedded font and the size. This is how I insert the same text with a different color (cyan):

page.insert_text(span["origin"], span["text"] , fontname =fontname, fontsize=span["size"], color=(0,1,1), rotate=rotate)

This works fine for most of the cases. It perfectly matches almost %80 of the original text on the PDF document.
I believe I can find the original text position, font, text size just fine.

But in many cases, the inserted text doesn't match the original text width.

Here, it is shorter. The original "red" text has wider character spacing.

In this example, the opposite. Inserted text exceeds the original text width.

I tried page.insert_textbox and textwriter.append methods too. But the result is almost the same. For the first case, insert_textbox doesn't insert the text at all since it doesn't fit into the given rect.

To me it doesn't look like a wrong font or wrong font size or a "scaling" error. I believe it is the spacing between the characters of a string.

So far I couldn't find anything related to this in the documentation. I decided to place characters one-by-one with the calculated spacing to fit text into the rect.

textWidth = myfont.text_length(span["text"],fontsize=span["size"]) 
boundingBoxWidth = span["bbox"][2] - span["bbox"][0]
coords = span["origin"] 
 l2 = coords[0]
for c in span["text"]:
      page.insert_text((l2,coords[1]), c , fontname =fontname, fontsize=span["size"], color=(0,1,1), rotate=rotate)
      l2 += ( myfont.text_length(c,fontsize=span["size"]) * (boundingBoxWidth/textWidth) )

This actually works much better:

But this doubles the PDF size and both processing and even reading are very slow.

Is there a more elegant way to do this?

Answered by JorjMcKie

Mar 15, 2024

Your basic problem is that the inter-character spacing information is not delivered to you by our text extraction:
In PDF you can treat character positions individually, sometimes letting the characters themselves "decide" about their distance to the predecessor, sometimes adding a modifier that shifts the current character just a bit, left or right.
Other differences stem from how justified text is implemented: if words are significantly apart from each other here, they form a separate span, in other cases (distance not large enough), MuPDF decides to leave them in the same span.
Etc.
Whatever algorithm you choose: it won't get perfect this way in a failsafe manner.
You probably have to …

View full answer

JorjMcKie · 2024-03-15T15:15:10Z

JorjMcKie
Mar 15, 2024
Maintainer

Your basic problem is that the inter-character spacing information is not delivered to you by our text extraction:
In PDF you can treat character positions individually, sometimes letting the characters themselves "decide" about their distance to the predecessor, sometimes adding a modifier that shifts the current character just a bit, left or right.
Other differences stem from how justified text is implemented: if words are significantly apart from each other here, they form a separate span, in other cases (distance not large enough), MuPDF decides to leave them in the same span.
Etc.
Whatever algorithm you choose: it won't get perfect this way in a failsafe manner.
You probably have to resort to extract using "rawdict", which gives you single character positions - and accordingly also output single characters. As you indicated.

Another option is using the morph parameter: this is a tuple (point, matrix).
Choose point to be the insertion point of a span, and Matrix a horizontal scale matrix mat = fitz.Matrix(scale, 1).
Compute scale such that the output text length exactly equals the bbox width of the span.
So, first compute the text length ("tl") of the text, then compute scale = bbox.width/tl.
This will stretch or shrink the text without changing its height.
Of course: still no perfect match by single characters - but at least by bbox width.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Matching the width of existing text on PDF #3267

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Matching the width of existing text on PDF #3267

Uh oh!

Uh oh!

gokhanbektas Mar 15, 2024

Replies: 1 comment

Uh oh!

Uh oh!

JorjMcKie Mar 15, 2024 Maintainer

gokhanbektas
Mar 15, 2024

JorjMcKie
Mar 15, 2024
Maintainer