You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add pybindings for multimodal LLM runner (pytorch#14285)
This pull request introduces Python bindings for the ExecuTorch
MultimodalRunner, enabling Python users to run multimodal LLM inference
(supporting text, image, and audio inputs) and generate text outputs.
The changes include new build system integration, a detailed
implementation plan and documentation, and a high-level Python API with
robust input handling and error management.
**Python Bindings Implementation:**
* Added a new high-level Python API in `__init__.py` for the
MultimodalRunner, providing user-friendly methods for text and image
input creation, text generation (with or without streaming callbacks),
and resource management. The API includes comprehensive input
validation, support for multiple image formats (file path, NumPy array,
PIL), and fallback mechanisms if dependencies are missing.
* Implemented robust error handling: if the C++ extension is not built,
placeholder classes and functions raise informative exceptions, guiding
users to rebuild with Python bindings enabled.
**Build System Integration:**
* Updated `CMakeLists.txt` to add a `pybind11`-based Python extension
module (`_llm_runner`) when `EXECUTORCH_BUILD_PYBIND` is set, linking
all necessary dependencies and setting up include paths.
**Documentation and Planning:**
* Added python API section to `README.md`.
**Utility and Extensibility:**
* Exposed utility functions (`load_image_from_file`, `preprocess_image`,
`create_generation_config`) for easier input preprocessing and
configuration from Python.
**Testing and Examples (Planned):**
* Added `test_runner_pybindings.py`.
**Code Snippet of How to Use:**
```python
from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input
from transformers import AutoProcessor
model_id = "google/gemma-3-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
image_url = "https://llava-vl.github.io/static/images/view.jpg"
conversation = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"type": "image", "url": image_url},
{
"type": "text",
"text": "What are the things I should be cautious about when I visit here?",
},
],
},
]
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt")
inputs_combined = [
make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"),
make_image_input(inputs["pixel_values"]),
make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"),
]
runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)
```
Output from console:
```
[multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:109] Prefilling input 0/3, type: text
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 2/3, type: text
[util.h:125] second_input_sizes[0] = 1023
What are the things I should be cautious about when I visit here?<end_of_turn>
You'
[multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048
re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare.
**1. Weather & Terrain – Expanded:**
* **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself.
* **Terrain Stability:** The
PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[stats.h:108] Prompt Tokens: 669 Generated Tokens: 99
[stats.h:114] Model Load Time: 2.297000 (seconds)
[stats.h:124] Total inference time: 20.401000 (seconds) Rate: 4.852703 (tokens/second)
[stats.h:132] Prompt evaluation: 12.661000 (seconds) Rate: 52.839428 (tokens/second)
[stats.h:143] Generated 99 tokens: 7.740000 (seconds) Rate: 12.790698 (tokens/second)
[stats.h:151] Time to first generated token: 12.661000 (seconds)
[stats.h:158] Sampling time over 768 tokens: 0.117000 (seconds)
```
cc @mergennachin@cccclai@helunwencser@jackzhxng
"text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
211
+
}
212
+
],
213
+
},
214
+
{
215
+
"role": "user",
216
+
"content": [
217
+
{"type": "image", "url": image_url},
218
+
{
219
+
"type": "text",
220
+
"text": "What are the things I should be cautious about when I visit here?",
0 commit comments