-
Notifications
You must be signed in to change notification settings - Fork 678
Closed
Labels
Description
Description of the bug
hello. a test page from a pdf with a text layer was translated successfully using https://medium.com/@pymupdf/translating-pdfs-a-practical-pymupdf-guide-c1c54b024042 as a template.
When processing the full 200 page document with python3 -v translator.py test.pdf the following ValueError shows up:
Traceback (most recent call last):
File "/home/user/pymupdf/translator.py", line 37, in <module>
page.insert_htmlbox(
File "/home/user/pymupdf/lib/python3.12/site-packages/pymupdf/__init__.py", line 12376, in insert_htmlbox
raise ValueError("'text' must be a string or a Story")
ValueError: 'text' must be a string or a Story
How to reproduce the bug
using Python 3.12.3 and PyMuPDF 1.26.6 in a venv with:
python3 -m venv pymupdf
source pymupdf/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install deep-translator
python3 -m pip install pymupdf
python3 -v translator.py test.pdf is run with this translator.py:
import pymupdf
from deep_translator import GoogleTranslator
# Define color "white"
WHITE = pymupdf.pdfcolor["white"]
# This flag ensures that text will be dehyphenated after extraction.
textflags = pymupdf.TEXT_DEHYPHENATE
# Configure the desired translator
to_english = GoogleTranslator(source="da", target="en")
# Open the document
doc = pymupdf.open("test.pdf")
# Define an Optional Content layer in the document named "English".
# Activate it by default.
ocg_xref = doc.add_ocg("English", on=True)
# Iterate over all pages
for page in doc:
# Extract text grouped like lines in a paragraph.
blocks = page.get_text("blocks", flags=textflags)
# Every block of text is contained in a rectangle ("bbox")
for block in blocks:
bbox = block[:4] # area containing the text
text = block[4] # the text of this block
# Invoke the actual translation to deliver us an English string
english = to_english.translate(text)
# Cover the Danish text with a white rectangle.
page.draw_rect(bbox, color=None, fill=WHITE, oc=ocg_xref)
# Write the English text into the original rectangle
page.insert_htmlbox(
bbox, english, css="* {font-family: sans-serif;}", oc=ocg_xref
)
doc.subset_fonts()
doc.ez_save("test_english.pdf")
After a minute or two the process crashes with:
Traceback (most recent call last):
File "/home/user/pymupdf/translator.py", line 37, in <module>
page.insert_htmlbox(
File "/home/user/pymupdf/lib/python3.12/site-packages/pymupdf/__init__.py", line 12376, in insert_htmlbox
raise ValueError("'text' must be a string or a Story")
ValueError: 'text' must be a string or a Story
PyMuPDF version
1.26.6
Operating system
Linux
Python version
3.12