You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix runtime bugs with multi-modal models (microsoft#1701)
### Description
This PR fixes runtime bugs with Phi-4 multi-modal and Gemma-3 vision. It
also fixes [this
issue](microsoft#1698).
### Motivation and Context
Bug 1: The `num_audio_tokens` value contains the correct number of audio
tokens but this was not being assigning to `num_audio_tokens_`, which is
used to initialize `audio_features_`.
https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/multi_modal.cpp#L121-L124
Bug 2: The image processor for Gemma-3 vision returns `num_img_tokens`
as `int32_t`.
https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/gemma_image_processor.cpp#L73-L74
However, the subsequent code that uses `num_img_tokens` is interpreting
it as `int64_t`.
https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/multi_modal.cpp#L15-L18
Bug 3: The `tokenizer.apply_chat_template` API does not appear to
integrate well with Phi-4 multi-modal. The result from applying the chat
template differs from the result when manually constructing the string
to tokenize.
When manually constructing the string to tokenize (input is 1 image + 1
prompt):
```
<|user|>
<|image_1|>
describe this<|end|>
<|assistant|>
```
When applying the chat template (input is 1 image + 1 prompt):
```
<|user|>[{'type': 'image'}, {'type': 'text', 'text': 'describe this'}]<|end|><|assistant|>
```
Because the `<|image_1|>` token is missing, a runtime error gets raised
that the number of image tokens does not match the number of images.
For now, the chat template changes to `phi4-mm.py` have been reverted to
avoid this error.
### Errors
The following errors occurred as a result of these bugs.
1. Out-of-memory allocations
CPU:
```python
Traceback (most recent call last):
File "/home/username/onnxruntime-genai/examples/python/model-vision.py", line 155, in <module>
run(args)
File "/home/username/onnxruntime-genai/examples/python/model-vision.py", line 114, in run
generator.set_inputs(inputs)
RuntimeError: std::bad_alloc
```
CUDA:
```python
RuntimeError: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 6755300567900160
```
2. C++ internal errors
```cpp
/opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:239: constexpr const std::basic_string_view<_CharT, _Traits>::value_type& std::basic_string_view<_CharT, _Traits>::operator[](size_type) const [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; const_reference = const char32_t&; size_type = long unsigned int]: Assertion '__pos < this->_M_len' failed.
```
3. Integer overflow errors
```python
Traceback (most recent call last):
File "C:\Users\username\Downloads\gemma-3-vision-it\run_vision.py", line 179, in <module>
run(args)
File "C:\Users\username\Downloads\gemma-3-vision-it\run_vision.py", line 137, in run
generator.set_inputs(inputs)
RuntimeError: D:\a\_work\1\s\onnxruntime\core/common/safeint.h:17 SafeIntExceptionHandler<class onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow Integer overflow
```
4. Pre-processing errors
```python
Traceback (most recent call last):
File "/home/username/onnxruntime-genai/examples/python/phi4-mm.py", line 181, in <module>
run(args)
File "/home/username/onnxruntime-genai/examples/python/phi4-mm.py", line 129, in run
inputs = processor(prompt, images=images, audios=audios)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Number of image tokens does not match the number of images. Please fix the prompt.
```
0 commit comments