Skip to content

Translation fails for Markdown cells containing large Base64-encoded image data #17

@jiange1236

Description

@jiange1236

Issue Description

Jupyter Translate fails to translate Markdown cells that contain large Base64-encoded image data within <img> tags, specifically when the src attribute uses the data: URI scheme. The error message indicates that the "Text length need to be between 0 and 5000 characters," suggesting that the Base64 data itself is being sent to the translation API, exceeding its character limit.

Steps to Reproduce

  1. https://github.com/unslothai/notebooks/nb/Qwen2.5_Coder_(1.5B)-Tool_Calling.ipynb
  2. Add a Markdown cell containing an <img> tag with a large Base64-encoded image in its src attribute. (You can generate a large Base64 image using online tools or by encoding a large image file).
    Example Markdown Cell Content:
### Example Tool Calling Diagram
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAAAKAAQMAAACc+H/0AAAAA1BMVEUAAACIAj+pAAAAAXNSR0IArs4c6QAAAARnQU... (very long base64 string) ...AAAASUVORK5CYII=" alt="Tool Calling Diagram">
  1. Run jupyter_translate on this notebook, targeting a language (e.g., Chinese chinese).
jupyter_translate translate --input-notebook /path/to/your/notebook.ipynb --output-notebook /path/to/translated_notebook.ipynb --target-language chinese

Expected Behavior

The Markdown cell should be translated, with the <img> tag (including its src attribute) being preserved as-is, or at least the translation process should gracefully handle the large data URI without failing. Ideally, the translation service should only translate the alt text or surrounding text, not the image data itself.

Actual Behavior

The jupyter_translate command fails with an Exception: Failed to translate after 3 attempts. The console output shows repeated errors like:
Error translating: ### Tool Calling <img src="data:image/png;base64,...
Image from xx_markdown_link_xx --> Text length need to be between 0 and 5000 characters. Trying again (1/3)...

This indicates that the Base64 image data is being included in the text sent to the translation API, leading to a length limit violation.

Environment

  • Python Version: Python 3.11.13

Possible Solutions/Suggestions

  • Implement a mechanism to identify and ignore data: URIs within <img> tags during the text extraction phase for translation.
  • Provide a configuration option to exclude specific HTML attributes (like src in <img> tags if it's a data: URI) from being sent to the translation service.
  • Document this limitation and recommend users convert Base64-encoded images to external image files when using jupyter_translate.

Thank you for your time and consideration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions