Skip to content

page.insert_htmlbox raises ValueError: 'text' must be a string or a Story #4794

@0nobody0

Description

@0nobody0

Description of the bug

hello. a test page from a pdf with a text layer was translated successfully using https://medium.com/@pymupdf/translating-pdfs-a-practical-pymupdf-guide-c1c54b024042 as a template.

When processing the full 200 page document with python3 -v translator.py test.pdf the following ValueError shows up:

Traceback (most recent call last):
  File "/home/user/pymupdf/translator.py", line 37, in <module>
    page.insert_htmlbox(
  File "/home/user/pymupdf/lib/python3.12/site-packages/pymupdf/__init__.py", line 12376, in insert_htmlbox
    raise ValueError("'text' must be a string or a Story")
ValueError: 'text' must be a string or a Story

How to reproduce the bug

using Python 3.12.3 and PyMuPDF 1.26.6 in a venv with:

python3 -m venv pymupdf
source pymupdf/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install deep-translator
python3 -m pip install pymupdf

python3 -v translator.py test.pdf is run with this translator.py:

import pymupdf
from deep_translator import GoogleTranslator

# Define color "white"
WHITE = pymupdf.pdfcolor["white"]

# This flag ensures that text will be dehyphenated after extraction.
textflags = pymupdf.TEXT_DEHYPHENATE

# Configure the desired translator
to_english = GoogleTranslator(source="da", target="en")

# Open the document
doc = pymupdf.open("test.pdf")

# Define an Optional Content layer in the document named "English".
# Activate it by default.
ocg_xref = doc.add_ocg("English", on=True)

# Iterate over all pages
for page in doc:
    # Extract text grouped like lines in a paragraph.
    blocks = page.get_text("blocks", flags=textflags)

    # Every block of text is contained in a rectangle ("bbox")
    for block in blocks:
        bbox = block[:4]  # area containing the text
        text = block[4]  # the text of this block

        # Invoke the actual translation to deliver us an English string
        english = to_english.translate(text)

        # Cover the Danish text with a white rectangle.
        page.draw_rect(bbox, color=None, fill=WHITE, oc=ocg_xref)

        # Write the English text into the original rectangle
        page.insert_htmlbox(
            bbox, english, css="* {font-family: sans-serif;}", oc=ocg_xref
        )

doc.subset_fonts()
doc.ez_save("test_english.pdf")

After a minute or two the process crashes with:

Traceback (most recent call last):
  File "/home/user/pymupdf/translator.py", line 37, in <module>
    page.insert_htmlbox(
  File "/home/user/pymupdf/lib/python3.12/site-packages/pymupdf/__init__.py", line 12376, in insert_htmlbox
    raise ValueError("'text' must be a string or a Story")
ValueError: 'text' must be a string or a Story

PyMuPDF version

1.26.6

Operating system

Linux

Python version

3.12

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions