Skip to content

Commit ad2a6ed

Browse files
Fix runtime bugs with multi-modal models (microsoft#1701)
### Description This PR fixes runtime bugs with Phi-4 multi-modal and Gemma-3 vision. It also fixes [this issue](microsoft#1698). ### Motivation and Context Bug 1: The `num_audio_tokens` value contains the correct number of audio tokens but this was not being assigning to `num_audio_tokens_`, which is used to initialize `audio_features_`. https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/multi_modal.cpp#L121-L124 Bug 2: The image processor for Gemma-3 vision returns `num_img_tokens` as `int32_t`. https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/gemma_image_processor.cpp#L73-L74 However, the subsequent code that uses `num_img_tokens` is interpreting it as `int64_t`. https://github.com/microsoft/onnxruntime-genai/blob/b3ddb21fd5a583ca1a45bf416e8f70ff7df2f4ba/src/models/multi_modal.cpp#L15-L18 Bug 3: The `tokenizer.apply_chat_template` API does not appear to integrate well with Phi-4 multi-modal. The result from applying the chat template differs from the result when manually constructing the string to tokenize. When manually constructing the string to tokenize (input is 1 image + 1 prompt): ``` <|user|> <|image_1|> describe this<|end|> <|assistant|> ``` When applying the chat template (input is 1 image + 1 prompt): ``` <|user|>[{'type': 'image'}, {'type': 'text', 'text': 'describe this'}]<|end|><|assistant|> ``` Because the `<|image_1|>` token is missing, a runtime error gets raised that the number of image tokens does not match the number of images. For now, the chat template changes to `phi4-mm.py` have been reverted to avoid this error. ### Errors The following errors occurred as a result of these bugs. 1. Out-of-memory allocations CPU: ```python Traceback (most recent call last): File "/home/username/onnxruntime-genai/examples/python/model-vision.py", line 155, in <module> run(args) File "/home/username/onnxruntime-genai/examples/python/model-vision.py", line 114, in run generator.set_inputs(inputs) RuntimeError: std::bad_alloc ``` CUDA: ```python RuntimeError: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 6755300567900160 ``` 2. C++ internal errors ```cpp /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:239: constexpr const std::basic_string_view<_CharT, _Traits>::value_type& std::basic_string_view<_CharT, _Traits>::operator[](size_type) const [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; const_reference = const char32_t&; size_type = long unsigned int]: Assertion '__pos < this->_M_len' failed. ``` 3. Integer overflow errors ```python Traceback (most recent call last): File "C:\Users\username\Downloads\gemma-3-vision-it\run_vision.py", line 179, in <module> run(args) File "C:\Users\username\Downloads\gemma-3-vision-it\run_vision.py", line 137, in run generator.set_inputs(inputs) RuntimeError: D:\a\_work\1\s\onnxruntime\core/common/safeint.h:17 SafeIntExceptionHandler<class onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow Integer overflow ``` 4. Pre-processing errors ```python Traceback (most recent call last): File "/home/username/onnxruntime-genai/examples/python/phi4-mm.py", line 181, in <module> run(args) File "/home/username/onnxruntime-genai/examples/python/phi4-mm.py", line 129, in run inputs = processor(prompt, images=images, audios=audios) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Number of image tokens does not match the number of images. Please fix the prompt. ```
1 parent 79d5fda commit ad2a6ed

File tree

3 files changed

+21
-25
lines changed

3 files changed

+21
-25
lines changed

examples/python/phi4-mm.py

Lines changed: 17 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
import os
66
import glob
77
import time
8-
import json
98
from pathlib import Path
109

1110
import onnxruntime_genai as og
@@ -64,9 +63,8 @@ def run(args: argparse.Namespace):
6463
model = og.Model(config)
6564
print("Model loaded")
6665

67-
tokenizer = og.Tokenizer(model)
6866
processor = model.create_multimodal_processor()
69-
stream = processor.create_stream()
67+
tokenizer_stream = processor.create_stream()
7068

7169
interactive = not args.non_interactive
7270

@@ -86,44 +84,40 @@ def run(args: argparse.Namespace):
8684

8785
images = None
8886
audios = None
87+
prompt = "<|user|>\n"
8988

90-
# Validate and open image paths
89+
# Get images
9190
if len(image_paths) == 0:
9291
print("No image provided")
9392
else:
94-
for image_path in image_paths:
93+
for i, image_path in enumerate(image_paths):
9594
if not os.path.exists(image_path):
9695
raise FileNotFoundError(f"Image file not found: {image_path}")
9796
print(f"Using image: {image_path}")
97+
prompt += f"<|image_{i+1}|>\n"
9898
images = og.Images.open(*image_paths)
9999

100-
# Validate and open audio paths
100+
# Get audios
101101
if len(audio_paths) == 0:
102102
print("No audio provided")
103103
else:
104-
for audio_path in audio_paths:
104+
for i, audio_path in enumerate(audio_paths):
105105
if not os.path.exists(audio_path):
106106
raise FileNotFoundError(f"Audio file not found: {audio_path}")
107107
print(f"Using audio: {audio_path}")
108+
prompt += f"<|audio_{i+1}|>\n"
108109
audios = og.Audios.open(*audio_paths)
109110

110-
# Get prompt text
111+
111112
if interactive:
112113
text = input("Prompt: ")
113114
else:
114-
text = args.prompt or "Does the audio summarize what is shown in the image? If not, what is different?"
115-
116-
# Build multimodal content list
117-
content_list = []
118-
content_list.extend([{"type": "image"} for _ in image_paths])
119-
content_list.extend([{"type": "audio"} for _ in audio_paths])
120-
content_list.append({"type": "text", "text": text})
121-
122-
# Construct messages and apply template
123-
messages = [{"role": "user", "content": content_list}]
124-
message_json = json.dumps(messages)
125-
prompt = tokenizer.apply_chat_template(message_json, add_generation_prompt=True)
126-
115+
if args.prompt:
116+
text = args.prompt
117+
else:
118+
text = "Does the audio summarize what is shown in the image? If not, what is different?"
119+
prompt += f"{text}<|end|>\n<|assistant|>\n"
120+
127121
print("Processing inputs...")
128122
inputs = processor(prompt, images=images, audios=audios)
129123
print("Processor complete.")
@@ -140,7 +134,7 @@ def run(args: argparse.Namespace):
140134
generator.generate_next_token()
141135

142136
new_token = generator.get_next_tokens()[0]
143-
print(stream.decode(new_token), end="", flush=True)
137+
print(tokenizer_stream.decode(new_token), end="", flush=True)
144138

145139
print()
146140
total_run_time = time.time() - start_time
@@ -177,4 +171,4 @@ def run(args: argparse.Namespace):
177171
'--non-interactive', action=argparse.BooleanOptionalAction, required=False, help='Non-interactive mode, mainly for CI usage'
178172
)
179173
args = parser.parse_args()
180-
run(args)
174+
run(args)

src/models/gemma_image_processor.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@ ProcessImagePrompt(const Generators::Tokenizer& tokenizer, const std::string& pr
7070
}
7171
}
7272

73-
std::unique_ptr<OrtValue> num_img_tokens = OrtValue::CreateTensor<int32_t>(allocator, std::vector<int64_t>{1});
74-
num_img_tokens->GetTensorMutableData<int32_t>()[0] = static_cast<int32_t>(image_seq_length);
73+
std::unique_ptr<OrtValue> num_img_tokens = OrtValue::CreateTensor<int64_t>(allocator, std::vector<int64_t>{1});
74+
num_img_tokens->GetTensorMutableData<int64_t>()[0] = static_cast<int64_t>(image_seq_length);
7575

7676
return {std::move(input_ids_value), std::move(token_type_ids), std::move(num_img_tokens)};
7777
}

src/models/multi_modal.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,8 @@ SpeechState::SpeechState(const MultiModalLanguageModel& model, const GeneratorPa
119119
model_{model} {}
120120

121121
void SpeechState::SetExtraInputs(const std::vector<ExtraInput>& extra_inputs, const int64_t num_audio_tokens) {
122+
num_audio_tokens_ = num_audio_tokens;
123+
122124
audio_features_ = std::make_unique<MultiModalFeatures>(*this, MultiModalFeatures::Mode::Output, // Model output
123125
model_.config_->model.speech.outputs.audio_features,
124126
-1, num_audio_tokens_);

0 commit comments

Comments
 (0)