-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Issue Description
Jupyter Translate fails to translate Markdown cells that contain large Base64-encoded image data within <img> tags, specifically when the src attribute uses the data: URI scheme. The error message indicates that the "Text length need to be between 0 and 5000 characters," suggesting that the Base64 data itself is being sent to the translation API, exceeding its character limit.
Steps to Reproduce
- https://github.com/unslothai/notebooks/nb/Qwen2.5_Coder_(1.5B)-Tool_Calling.ipynb
- Add a Markdown cell containing an
<img>tag with a large Base64-encoded image in itssrcattribute. (You can generate a large Base64 image using online tools or by encoding a large image file).
Example Markdown Cell Content:
### Example Tool Calling Diagram
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAAAKAAQMAAACc+H/0AAAAA1BMVEUAAACIAj+pAAAAAXNSR0IArs4c6QAAAARnQU... (very long base64 string) ...AAAASUVORK5CYII=" alt="Tool Calling Diagram">- Run
jupyter_translateon this notebook, targeting a language (e.g., Chinesechinese).
jupyter_translate translate --input-notebook /path/to/your/notebook.ipynb --output-notebook /path/to/translated_notebook.ipynb --target-language chineseExpected Behavior
The Markdown cell should be translated, with the <img> tag (including its src attribute) being preserved as-is, or at least the translation process should gracefully handle the large data URI without failing. Ideally, the translation service should only translate the alt text or surrounding text, not the image data itself.
Actual Behavior
The jupyter_translate command fails with an Exception: Failed to translate after 3 attempts. The console output shows repeated errors like:
Error translating: ### Tool Calling <img src="data:image/png;base64,...
Image from xx_markdown_link_xx --> Text length need to be between 0 and 5000 characters. Trying again (1/3)...
This indicates that the Base64 image data is being included in the text sent to the translation API, leading to a length limit violation.
Environment
- Python Version:
Python 3.11.13
Possible Solutions/Suggestions
- Implement a mechanism to identify and ignore
data:URIs within<img>tags during the text extraction phase for translation. - Provide a configuration option to exclude specific HTML attributes (like
srcin<img>tags if it's adata:URI) from being sent to the translation service. - Document this limitation and recommend users convert Base64-encoded images to external image files when using
jupyter_translate.
Thank you for your time and consideration.