[Model] Define merge_by_field_config MM interface (U-Z) #26261

ayushsatyam146 · 2025-10-05T17:28:33Z

gemini-code-assist

Code Review

This pull request refactors the Ultravox, Voxtral, and Whisper models to use the new merge_by_field_config multimodal interface. The changes involve setting this flag to True in the model classes and removing the flatten_bn utility, as its functionality is now handled by the new data processing pipeline. The code modifications are consistent and appear correct for this refactoring. I did not identify any issues of high or critical severity.

chatgpt-codex-connector · 2025-10-05T17:32:58Z

💡 Codex Review

vllm/vllm/model_executor/models/ultravox.py

Lines 551 to 574 in a7615f9

    
           # Audio lens and token_len are already in the correct shape 
        
           audio_lens = audio_input["lens"] 
        
           audio_token_len = audio_input["token_len"] 
        
           embeddings = self._audio_features_to_embeddings(audio_features, audio_lens) 
        
           # We should flatten and concatenate embeddings based on token lengths 
        
           # For example, with token_len = [4, 2, 3], flattened_embeddings will be 
        
           # concat(embeddings[0][:4], embeddings[1][:2], embeddings[2][:3]) 
        
           # Create a mask of valid indices based on token lengths 
        
           max_len = embeddings.shape[1] 
        
           indices = torch.arange(max_len, device=embeddings.device).expand( 
        
               embeddings.shape[0], -1 
        
           ) 
        
           mask = indices < audio_token_len[:, None] 
        
           # Apply mask and flatten 
        
           flattened_embeddings = embeddings[mask] 
        
           # Return one tensor per input audio 
        
           embed_lens = [ 
        
               token_len_item.sum().item() for token_len_item in audio_input["token_len"] 
        
           ] 
        
           return flattened_embeddings.split(embed_lens)

Splitting Ultravox embeddings per chunk instead of per audio

In _process_audio_input the lens and token lengths are now taken as-is (audio_token_len = audio_input["token_len"]) because the new merge_by_field_config path already flattens the B×N dimensions. However, embed_lens is still built by iterating over audio_input["token_len"] and summing each element. With the flattened inputs this loop now produces one entry per chunk rather than one entry per audio item, so flattened_embeddings.split(embed_lens) returns a list of tensors per chunk while _get_prompt_updates still emits only one PromptReplacement per audio based on audio_num_chunks. The number and ordering of multimodal embeddings therefore no longer matches the placeholders inserted into the prompt, which will misalign audio embeddings with their token replacements (and can raise when counts differ). Consider rebuilding embed_lens using the original per-audio grouping (e.g. via audio_num_chunks) so the function continues to return one tensor per audio item.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

DarkLight1337 · 2025-10-06T03:34:29Z

Have you tested each model using the example script?

vllm/model_executor/models/ultravox.py

Signed-off-by: Ayush Satyam <[email protected]>

ayushsatyam146 · 2025-10-06T18:22:21Z

@DarkLight1337 I have done the required changes. But, I am sorry since I couldn't run the example scripts due to GPU constraints

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337

Voxtral and Whisper are both failing locally, I have fixed Voxtral and will fix Whisper later

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337

Fixed Whisper as well, let's merge this

Signed-off-by: DarkLight1337 <[email protected]>

…#26261) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

ayushsatyam146 requested a review from patrickvonplaten as a code owner October 5, 2025 17:28

gemini-code-assist bot reviewed Oct 5, 2025

View reviewed changes

ayushsatyam146 force-pushed the merge_by_field_config-X branch from a7615f9 to 81fe190 Compare October 5, 2025 17:32

DarkLight1337 mentioned this pull request Oct 6, 2025

[Tracking Issue]: Use merge_by_field_config for MM models #26149

Open

13 tasks

DarkLight1337 reviewed Oct 6, 2025

View reviewed changes

vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved

[Model] Use merge_by_field_config for MM models (U-Z)

730e6e7

Signed-off-by: Ayush Satyam <[email protected]>

ayushsatyam146 force-pushed the merge_by_field_config-X branch from 81fe190 to 730e6e7 Compare October 6, 2025 05:44

Fix Ultravox

9e85a2b

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 reviewed Oct 6, 2025

View reviewed changes

Fix whisper

a1a1c40

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 approved these changes Oct 7, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 7, 2025 03:47

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

Fix ultravox

7c8e88b

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 requested review from ywang96 and NickLucche as code owners October 7, 2025 04:58

mergify bot added the multi-modality Related to multi-modality (#4194) label Oct 7, 2025

DarkLight1337 merged commit 5f7e8a9 into vllm-project:main Oct 7, 2025
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Define merge_by_field_config MM interface (U-Z) #26261

[Model] Define merge_by_field_config MM interface (U-Z) #26261

ayushsatyam146 commented Oct 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot commented Oct 5, 2025

Uh oh!

DarkLight1337 commented Oct 6, 2025

Uh oh!

Uh oh!

ayushsatyam146 commented Oct 6, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Model] Define merge_by_field_config MM interface (U-Z) #26261

[Model] Define merge_by_field_config MM interface (U-Z) #26261

Conversation

ayushsatyam146 commented Oct 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Oct 5, 2025

💡 Codex Review

Uh oh!

DarkLight1337 commented Oct 6, 2025

Uh oh!

Uh oh!

ayushsatyam146 commented Oct 6, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ayushsatyam146 commented Oct 5, 2025 •

edited by github-actions bot

Loading