Skip to content

Commit c7596ba

Browse files
authored
Add pybindings for multimodal LLM runner (pytorch#14285)
This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management. **Python Bindings Implementation:** * Added a new high-level Python API in `__init__.py` for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing. * Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled. **Build System Integration:** * Updated `CMakeLists.txt` to add a `pybind11`-based Python extension module (`_llm_runner`) when `EXECUTORCH_BUILD_PYBIND` is set, linking all necessary dependencies and setting up include paths. **Documentation and Planning:** * Added python API section to `README.md`. **Utility and Extensibility:** * Exposed utility functions (`load_image_from_file`, `preprocess_image`, `create_generation_config`) for easier input preprocessing and configuration from Python. **Testing and Examples (Planned):** * Added `test_runner_pybindings.py`. **Code Snippet of How to Use:** ```python from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input from transformers import AutoProcessor model_id = "google/gemma-3-4b-it" processor = AutoProcessor.from_pretrained(model_id) image_url = "https://llava-vl.github.io/static/images/view.jpg" conversation = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, { "role": "user", "content": [ {"type": "image", "url": image_url}, { "type": "text", "text": "What are the things I should be cautious about when I visit here?", }, ], }, ] inputs = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt") inputs_combined = [ make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), make_image_input(inputs["pixel_values"]), make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"), ] runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None) config = GenerationConfig() config.max_new_tokens = 100 runner.generate(inputs_combined, config) ``` Output from console: ``` [multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:109] Prefilling input 0/3, type: text [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 1/3, type: image [multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float [util.h:125] second_input_sizes[0] = 1023 [multimodal_runner.cpp:109] Prefilling input 2/3, type: text [util.h:125] second_input_sizes[0] = 1023 What are the things I should be cautious about when I visit here?<end_of_turn> You' [multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported) [multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048 re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare. **1. Weather & Terrain – Expanded:** * **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself. * **Terrain Stability:** The PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000} [stats.h:108] Prompt Tokens: 669 Generated Tokens: 99 [stats.h:114] Model Load Time: 2.297000 (seconds) [stats.h:124] Total inference time: 20.401000 (seconds) Rate: 4.852703 (tokens/second) [stats.h:132] Prompt evaluation: 12.661000 (seconds) Rate: 52.839428 (tokens/second) [stats.h:143] Generated 99 tokens: 7.740000 (seconds) Rate: 12.790698 (tokens/second) [stats.h:151] Time to first generated token: 12.661000 (seconds) [stats.h:158] Sampling time over 768 tokens: 0.117000 (seconds) ``` cc @mergennachin @cccclai @helunwencser @jackzhxng
1 parent d973635 commit c7596ba

File tree

16 files changed

+2198
-67
lines changed

16 files changed

+2198
-67
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
828ae02053a6e0e20a2dfd6e737ba10c6f4dee6b
1+
bd06b54e627fbfd354a2cffa4c80fb21883209a9

.ci/scripts/test_huggingface_optimum_model.py

Lines changed: 114 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,9 @@ def cli_export(command, model_dir):
4343

4444

4545
def check_causal_lm_output_quality(
46-
model_id: str, generated_tokens: List[int], max_perplexity_threshold: float = 100.0
46+
model_id: str,
47+
generated_tokens: List[int],
48+
max_perplexity_threshold: float = 100.0,
4749
):
4850
"""
4951
Evaluates the quality of text generated by a causal language model by calculating its perplexity.
@@ -58,12 +60,24 @@ def check_causal_lm_output_quality(
5860
"""
5961
logging.info(f"Starting perplexity check with model '{model_id}' ...")
6062
# Load model
61-
model = AutoModelForCausalLM.from_pretrained(
62-
model_id,
63-
low_cpu_mem_usage=True,
64-
use_cache=False,
65-
torch_dtype=torch.bfloat16,
66-
)
63+
cls_name = AutoModelForCausalLM
64+
if "llava" in model_id:
65+
from transformers import LlavaForConditionalGeneration
66+
67+
cls_name = LlavaForConditionalGeneration
68+
try:
69+
model = cls_name.from_pretrained(
70+
model_id,
71+
low_cpu_mem_usage=True,
72+
use_cache=False,
73+
torch_dtype=torch.bfloat16,
74+
)
75+
except TypeError:
76+
model = cls_name.from_pretrained(
77+
model_id,
78+
low_cpu_mem_usage=True,
79+
torch_dtype=torch.bfloat16,
80+
)
6781

6882
with torch.no_grad():
6983
outputs = model(input_ids=generated_tokens, labels=generated_tokens)
@@ -156,6 +170,86 @@ def test_text_generation(model_id, model_dir, recipe, *, quantize=True, run_only
156170
assert check_causal_lm_output_quality(model_id, generated_tokens) is True
157171

158172

173+
def test_llm_with_image_modality(
174+
model_id, model_dir, recipe, *, quantize=True, run_only=False
175+
):
176+
command = [
177+
"optimum-cli",
178+
"export",
179+
"executorch",
180+
"--model",
181+
model_id,
182+
"--task",
183+
"multimodal-text-to-text",
184+
"--recipe",
185+
recipe,
186+
"--output_dir",
187+
model_dir,
188+
"--use_custom_sdpa",
189+
"--use_custom_kv_cache",
190+
"--qlinear",
191+
"8da4w",
192+
"--qembedding",
193+
"8w",
194+
]
195+
if not run_only:
196+
cli_export(command, model_dir)
197+
198+
tokenizer = AutoTokenizer.from_pretrained(model_id)
199+
tokenizer.save_pretrained(model_dir)
200+
201+
# input
202+
processor = AutoProcessor.from_pretrained(model_id)
203+
image_url = "https://llava-vl.github.io/static/images/view.jpg"
204+
conversation = [
205+
{
206+
"role": "system",
207+
"content": [
208+
{
209+
"type": "text",
210+
"text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
211+
}
212+
],
213+
},
214+
{
215+
"role": "user",
216+
"content": [
217+
{"type": "image", "url": image_url},
218+
{
219+
"type": "text",
220+
"text": "What are the things I should be cautious about when I visit here?",
221+
},
222+
],
223+
},
224+
]
225+
inputs = processor.apply_chat_template(
226+
conversation,
227+
add_generation_prompt=True,
228+
tokenize=True,
229+
return_dict=True,
230+
return_tensors="pt",
231+
)
232+
233+
from executorch.extension.llm.runner import GenerationConfig, MultimodalRunner
234+
235+
runner = MultimodalRunner(f"{model_dir}/model.pte", f"{model_dir}/tokenizer.model")
236+
generated_text = runner.generate_text_hf(
237+
inputs,
238+
GenerationConfig(max_new_tokens=128, temperature=0, echo=False),
239+
processor.image_token_id,
240+
)
241+
print(f"\nGenerated text:\n\t{generated_text}")
242+
# Free memory before loading eager for quality check
243+
del runner
244+
gc.collect()
245+
assert (
246+
check_causal_lm_output_quality(
247+
model_id, tokenizer.encode(generated_text, return_tensors="pt")
248+
)
249+
is True
250+
)
251+
252+
159253
def test_fill_mask(model_id, model_dir, recipe, *, quantize=True, run_only=False):
160254
command = [
161255
"optimum-cli",
@@ -353,6 +447,9 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
353447
required=False,
354448
help="When provided, write the pte file to this directory. Otherwise, a temporary directory is created for the test.",
355449
)
450+
parser.add_argument(
451+
"--run_only", action="store_true", help="Skip export and only run the test"
452+
)
356453
args = parser.parse_args()
357454

358455
_text_generation_mapping = {
@@ -384,8 +481,16 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
384481
"vit": ("google/vit-base-patch16-224", test_vit),
385482
}
386483

484+
_multimodal_model_mapping = {
485+
"gemma3-4b": ("google/gemma-3-4b-it", test_llm_with_image_modality),
486+
"llava": ("llava-hf/llava-1.5-7b-hf", test_llm_with_image_modality),
487+
}
488+
387489
model_to_model_id_and_test_function = (
388-
_text_generation_mapping | _mask_fill_mapping | _misc_model_mapping
490+
_text_generation_mapping
491+
| _mask_fill_mapping
492+
| _misc_model_mapping
493+
| _multimodal_model_mapping
389494
)
390495

391496
if args.model not in model_to_model_id_and_test_function:
@@ -400,4 +505,5 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
400505
model_dir=tmp_dir if args.model_dir is None else args.model_dir,
401506
recipe=args.recipe,
402507
quantize=args.quantize,
508+
run_only=args.run_only,
403509
)

.github/workflows/pull.yml

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -286,15 +286,20 @@ jobs:
286286
# Test selective build
287287
PYTHON_EXECUTABLE=python bash examples/selective_build/test_selective_build.sh "${BUILD_TOOL}"
288288
289-
test-llava-runner-linux:
290-
name: test-llava-runner-linux
289+
test-multimodal-linux:
290+
if: ${{ !github.event.pull_request.head.repo.fork }}
291+
name: test-multimodal-linux
291292
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
292293
permissions:
293294
id-token: write
294295
contents: read
296+
secrets: inherit
295297
strategy:
296298
fail-fast: false
299+
matrix:
300+
model: ["gemma3-4b"] # llava gives segfault so not covering.
297301
with:
302+
secrets-env: EXECUTORCH_HF_TOKEN
298303
runner: linux.24xlarge
299304
docker-image: ci-image:executorch-ubuntu-22.04-clang12
300305
submodules: 'recursive'
@@ -305,17 +310,20 @@ jobs:
305310
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
306311
conda activate "${CONDA_ENV}"
307312
313+
echo "::group::Setup ExecuTorch"
308314
PYTHON_EXECUTABLE=python bash .ci/scripts/setup-linux.sh --build-tool "cmake"
309-
310-
# install Llava requirements
311-
bash examples/models/llama/install_requirements.sh
312-
bash examples/models/llava/install_requirements.sh
313-
314-
# run python unittest
315-
python -m unittest examples.models.llava.test.test_llava
316-
317-
# run e2e (export, tokenizer and runner)
318-
PYTHON_EXECUTABLE=python bash .ci/scripts/test_llava.sh
315+
echo "::endgroup::"
316+
317+
echo "::group::Setup Huggingface"
318+
pip install -U "huggingface_hub[cli]" accelerate
319+
huggingface-cli login --token $SECRET_EXECUTORCH_HF_TOKEN
320+
OPTIMUM_ET_VERSION=$(cat .ci/docker/ci_commit_pins/optimum-executorch.txt)
321+
pip install git+https://github.com/huggingface/optimum-executorch.git@${OPTIMUM_ET_VERSION}
322+
echo "::endgroup::"
323+
324+
echo "::group::Test ${{ matrix.model }}"
325+
python .ci/scripts/test_huggingface_optimum_model.py --model ${{ matrix.model }} --quantize --recipe xnnpack
326+
echo "::endgroup::"
319327
320328
test-moshi-linux:
321329
name: test-moshi-linux

.github/workflows/trunk.yml

Lines changed: 39 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -616,34 +616,45 @@ jobs:
616616
617617
bash .ci/scripts/test_torchao_huggingface_checkpoints.sh ${{ matrix.model }} ${{ matrix.test_with_runner && '--test_with_runner' || '' }}
618618
619-
# # TODO(jackzhxng): Runner consistently runs out of memory before test finishes. Try to find a more powerful runner.
620-
# test-llava-runner-macos:
621-
# name: test-llava-runner-macos
622-
# uses: pytorch/test-infra/.github/workflows/macos_job.yml@main
623-
# strategy:
624-
# fail-fast: false
625-
# with:
626-
# runner: macos-14-xlarge
627-
# python-version: '3.11'
628-
# submodules: 'recursive'
629-
# ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
630-
# timeout: 900
631-
# script: |
632-
# BUILD_TOOL=cmake
633-
634-
# bash .ci/scripts/setup-conda.sh
635-
# # Setup MacOS dependencies as there is no Docker support on MacOS atm
636-
# GITHUB_RUNNER=1 PYTHON_EXECUTABLE=python ${CONDA_RUN} bash .ci/scripts/setup-macos.sh --build-tool "${BUILD_TOOL}"
637-
638-
# # install Llava requirements
639-
# ${CONDA_RUN} bash examples/models/llama/install_requirements.sh
640-
# ${CONDA_RUN} bash examples/models/llava/install_requirements.sh
641-
642-
# # run python unittest
643-
# ${CONDA_RUN} python -m unittest examples.models.llava.test.test_llava
644-
645-
# # run e2e (export, tokenizer and runner)
646-
# PYTHON_EXECUTABLE=python ${CONDA_RUN} bash .ci/scripts/test_llava.sh
619+
test-multimodal-macos:
620+
if: ${{ !github.event.pull_request.head.repo.fork }}
621+
name: test-multimodal-macos
622+
uses: pytorch/test-infra/.github/workflows/macos_job.yml@main
623+
permissions:
624+
id-token: write
625+
contents: read
626+
secrets: inherit
627+
strategy:
628+
fail-fast: false
629+
matrix:
630+
model: ["gemma3-4b"] # llava gives segfault so not covering.
631+
with:
632+
secrets-env: EXECUTORCH_HF_TOKEN
633+
runner: macos-15-xlarge
634+
python-version: '3.11'
635+
submodules: 'recursive'
636+
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
637+
timeout: 90
638+
script: |
639+
echo "::group::Set up ExecuTorch"
640+
bash .ci/scripts/setup-conda.sh
641+
eval "$(conda shell.bash hook)"
642+
643+
# Install requirements
644+
${CONDA_RUN} python install_executorch.py
645+
echo "::endgroup::"
646+
647+
echo "::group::Set up Huggingface"
648+
${CONDA_RUN} pip install -U "huggingface_hub[cli]" accelerate
649+
${CONDA_RUN} huggingface-cli login --token $SECRET_EXECUTORCH_HF_TOKEN
650+
OPTIMUM_ET_VERSION=$(cat .ci/docker/ci_commit_pins/optimum-executorch.txt)
651+
${CONDA_RUN} pip install git+https://github.com/huggingface/optimum-executorch.git@${OPTIMUM_ET_VERSION}
652+
${CONDA_RUN} pip list
653+
echo "::endgroup::"
654+
655+
echo "::group::Test ${{ matrix.model }}"
656+
${CONDA_RUN} python .ci/scripts/test_huggingface_optimum_model.py --model ${{ matrix.model }} --quantize --recipe xnnpack
657+
echo "::endgroup::"
647658
648659
test-qnn-model:
649660
name: test-qnn-model

CMakeLists.txt

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -650,15 +650,6 @@ if(EXECUTORCH_BUILD_EXTENSION_LLM)
650650
list(APPEND _executorch_extensions tokenizers)
651651
endif()
652652

653-
if(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER)
654-
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/runner)
655-
list(APPEND _executorch_extensions extension_llm_runner)
656-
endif()
657-
658-
if(EXECUTORCH_BUILD_EXTENSION_LLM_APPLE)
659-
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/apple)
660-
endif()
661-
662653
if(EXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL)
663654
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/runner_util)
664655
install(
@@ -904,6 +895,15 @@ if(EXECUTORCH_BUILD_EXTENSION_TRAINING)
904895
list(APPEND _executorch_extensions extension_training)
905896
endif()
906897

898+
if(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER)
899+
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/runner)
900+
list(APPEND _executorch_extensions extension_llm_runner)
901+
endif()
902+
903+
if(EXECUTORCH_BUILD_EXTENSION_LLM_APPLE)
904+
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/apple)
905+
endif()
906+
907907
if(EXECUTORCH_BUILD_KERNELS_LLM)
908908
# TODO: move all custom kernels to ${CMAKE_CURRENT_SOURCE_DIR}/kernels/custom
909909
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/custom_ops)

examples/models/llava/install_requirements.sh

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,4 @@
77

88
set -x
99

10-
pip install transformers accelerate sentencepiece tiktoken
11-
12-
# Run llama2/install requirements for torchao deps
13-
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
14-
15-
bash "$SCRIPT_DIR"/../llama/install_requirements.sh
10+
pip install git+https://github.com/huggingface/optimum-executorch.git@d4d3046738ca31b5542506aaa76a28d540600227

examples/models/llava/main.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,8 +131,7 @@ int32_t main(int32_t argc, char** argv) {
131131
#endif
132132
// Load tokenizer
133133
std::unique_ptr<::tokenizers::Tokenizer> tokenizer =
134-
std::make_unique<tokenizers::Llama2cTokenizer>();
135-
tokenizer->load(tokenizer_path);
134+
::executorch::extension::llm::load_tokenizer(tokenizer_path);
136135
if (tokenizer == nullptr) {
137136
ET_LOG(Error, "Failed to load tokenizer from: %s", tokenizer_path);
138137
return 1;

extension/llm/runner/CMakeLists.txt

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,43 @@ install(
7979
if(BUILD_TESTING)
8080
add_subdirectory(test)
8181
endif()
82+
83+
# Python bindings for MultimodalRunner
84+
if(EXECUTORCH_BUILD_PYBIND)
85+
# Create the Python extension module for LLM runners
86+
pybind11_add_module(
87+
_llm_runner SHARED ${CMAKE_CURRENT_SOURCE_DIR}/pybindings.cpp
88+
)
89+
90+
find_package_torch()
91+
find_library(
92+
TORCH_PYTHON_LIBRARY torch_python PATHS "${TORCH_INSTALL_PREFIX}/lib"
93+
)
94+
# Link with the extension_llm_runner library and its dependencies
95+
target_link_libraries(
96+
_llm_runner PRIVATE extension_llm_runner tokenizers::tokenizers
97+
portable_lib ${TORCH_PYTHON_LIBRARY} ${TORCH_LIBRARIES}
98+
)
99+
100+
# Set properties for the Python extension
101+
set_target_properties(
102+
_llm_runner
103+
PROPERTIES POSITION_INDEPENDENT_CODE ON
104+
CXX_VISIBILITY_PRESET "hidden"
105+
INTERPROCEDURAL_OPTIMIZATION TRUE
106+
)
107+
if(APPLE)
108+
set(RPATH "@loader_path/../../pybindings")
109+
else()
110+
set(RPATH "$ORIGIN/../../pybindings")
111+
endif()
112+
set_target_properties(_llm_runner PROPERTIES INSTALL_RPATH ${RPATH})
113+
# Add include directories
114+
target_include_directories(
115+
_llm_runner PRIVATE ${_common_include_directories} ${TORCH_INCLUDE_DIRS}
116+
)
117+
118+
install(TARGETS _llm_runner
119+
LIBRARY DESTINATION executorch/extension/llm/runner
120+
)
121+
endif()

0 commit comments

Comments
 (0)