Original text visible beneath translated text. #2317

chad130 · 2023-04-02T23:16:27Z

chad130
Apr 2, 2023

I'm having trouble figuring out how to identify and remove the original non-English text from a translated pdf. Is this possible? I can pull the English text using "get_text()", so maybe I have to construct a new pdf somehow?

This occurs after turning a non-searchable pdf to an image(pymupdf), then ocr(Tesseract), then back to pdf. I then run the pdf through Google Translate and the overwritten text is the result.

pdf before translation:
page_4.pdf

pdf after translation:
page_4_tr.pdf

Answered by JorjMcKie

Apr 3, 2023

The translation has converted the original page to an image, upon which the translated text has been written.
So the easiest way probably is to just remove that image:

import fitz
doc=fitz.open("page_4_tr.pdf")
page=doc[0]
print(page.get_images())
[(13, 0, 1681, 2378, 8, 'DeviceRGB', '', 'FXX1', 'DCTDecode')]
page.delete_image(13)
doc.save("english.pdf", garbage=3, deflate=True)

This gives a page with only the English text - nothing else. Note the garbage collection and compression options.

View full answer

JorjMcKie · 2023-04-03T08:40:20Z

JorjMcKie
Apr 3, 2023
Maintainer

The translation has converted the original page to an image, upon which the translated text has been written.
So the easiest way probably is to just remove that image:

import fitz
doc=fitz.open("page_4_tr.pdf")
page=doc[0]
print(page.get_images())
[(13, 0, 1681, 2378, 8, 'DeviceRGB', '', 'FXX1', 'DCTDecode')]
page.delete_image(13)
doc.save("english.pdf", garbage=3, deflate=True)

This gives a page with only the English text - nothing else. Note the garbage collection and compression options.

2 replies

chad130 Apr 3, 2023
Author

Maybe I'm doing something wrong, but whenever I try this method, the "page.delete_image(13)" function triggers an AtrributeError:
AttributeError: 'Document' object has no attribute 'is_image'

Otherwise, I get the same response from get_images() as you. Knowing it's an image is new info, so I'll play around with that and maybe I'll find another way. Any additional help you could provide would be appreciated.

JorjMcKie Apr 3, 2023
Maintainer

there was an error in an earlier release, which used a wrong method name
Either upgrade or do Document.is_image = Document.xref_is_image.

chad130 · 2023-04-04T03:56:23Z

chad130
Apr 4, 2023
Author

I think I'm using the latest version(1.21.1). I uninstalled, purged cache, and reinstalled to verify. The problem seems to be with the fitz.py file. It contains the old method you referenced.

Is it possible I'm not getting the latest version of fitz.py with my pip install? I receive version 4.0.2

Either way, the fix u suggested did the job. Alternatively I had success changing the xref_is_image method to is_image within the fitz.py.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Original text visible beneath translated text. #2317

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Original text visible beneath translated text. #2317

Uh oh!

chad130 Apr 2, 2023

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

JorjMcKie Apr 3, 2023 Maintainer

Uh oh!

chad130 Apr 3, 2023 Author

Uh oh!

Uh oh!

JorjMcKie Apr 3, 2023 Maintainer

Uh oh!

chad130 Apr 4, 2023 Author

chad130
Apr 2, 2023

Replies: 2 comments 2 replies

JorjMcKie
Apr 3, 2023
Maintainer

chad130 Apr 3, 2023
Author

JorjMcKie Apr 3, 2023
Maintainer

chad130
Apr 4, 2023
Author