Skip to content

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Sep 12, 2025

This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.

Python Bindings Implementation:

  • Added a new high-level Python API in __init__.py for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.
  • Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled.

Build System Integration:

  • Updated CMakeLists.txt to add a pybind11-based Python extension module (_llm_runner) when EXECUTORCH_BUILD_PYBIND is set, linking all necessary dependencies and setting up include paths.

Documentation and Planning:

  • Added python API section to README.md.

Utility and Extensibility:

  • Exposed utility functions (load_image_from_file, preprocess_image, create_generation_config) for easier input preprocessing and configuration from Python.

Testing and Examples (Planned):

  • Added test_runner_pybindings.py.

Code Snippet of How to Use:

from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input
from transformers import AutoProcessor
model_id = "google/gemma-3-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
image_url = "https://llava-vl.github.io/static/images/view.jpg"
conversation = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": image_url},
            {
                "type": "text",
                "text": "What are the things I should be cautious about when I visit here?",
            },
        ],
    },
]
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt")
inputs_combined = [
    make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), 
    make_image_input(inputs["pixel_values"]), 
    make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"),
]
runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)

Output from console:

[multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:109] Prefilling input 0/3, type: text
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 2/3, type: text
[util.h:125] second_input_sizes[0] = 1023
What are the things I should be cautious about when I visit here?<end_of_turn>


You'
[multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048
re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare.

**1. Weather & Terrain – Expanded:**

*   **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself.
*   **Terrain Stability:** The
PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[stats.h:108] 	Prompt Tokens: 669    Generated Tokens: 99
[stats.h:114] 	Model Load Time:		2.297000 (seconds)
[stats.h:124] 	Total inference time:		20.401000 (seconds)		 Rate: 	4.852703 (tokens/second)
[stats.h:132] 		Prompt evaluation:	12.661000 (seconds)		 Rate: 	52.839428 (tokens/second)
[stats.h:143] 		Generated 99 tokens:	7.740000 (seconds)		 Rate: 	12.790698 (tokens/second)
[stats.h:151] 	Time to first generated token:	12.661000 (seconds)
[stats.h:158] 	Sampling time over 768 tokens:	0.117000 (seconds)

cc @mergennachin @cccclai @helunwencser @jackzhxng

Copy link

pytorch-bot bot commented Sep 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Pending, 1 Unrelated Failure

As of commit f8ace7d with merge base c9f46e2 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025
@larryliu0820 larryliu0820 added module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities labels Sep 13, 2025
@larryliu0820 larryliu0820 changed the title Add pybindings for LLM runners Add pybindings for multimodal LLM runners Sep 15, 2025
@larryliu0820 larryliu0820 changed the title Add pybindings for multimodal LLM runners Add pybindings for multimodal LLM runner Sep 15, 2025
ValueError: If the image format is not supported
FileNotFoundError: If the image file doesn't exist
"""
if isinstance(image, (str, Path)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you use the CV preprocessing utils function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.

@larryliu0820 larryliu0820 force-pushed the mm_runner_ext branch 2 times, most recently from a23bee6 to f68cc69 Compare September 19, 2025 21:57
Copy link
Contributor

@JacobSzwejbka JacobSzwejbka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

.def("is_audio", &MultimodalInput::is_audio)
.def("is_raw_audio", &MultimodalInput::is_raw_audio)
.def(
"get_text",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally convinced all these getter impl are correct

print(f"Image: {image.width}x{image.height}x{image.channels}")

# Check input types safely
if text_input.is_text():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a user ever need to do this?

"""Reset the conversation state"""
self.runner.reset()

# Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this section just be a demo.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would be good to have a notebook. But I'll leave it here for now

@larryliu0820 larryliu0820 merged commit c7596ba into main Sep 23, 2025
412 of 419 checks passed
@larryliu0820 larryliu0820 deleted the mm_runner_ext branch September 23, 2025 21:08
@jackzhxng
Copy link
Contributor

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants