Skip to content

Conversation

ayushsatyam146
Copy link
Contributor

@ayushsatyam146 ayushsatyam146 commented Oct 5, 2025

Part of #26149

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Ultravox, Voxtral, and Whisper models to use the new merge_by_field_config multimodal interface. The changes involve setting this flag to True in the model classes and removing the flatten_bn utility, as its functionality is now handled by the new data processing pipeline. The code modifications are consistent and appear correct for this refactoring. I did not identify any issues of high or critical severity.

@ayushsatyam146 ayushsatyam146 force-pushed the merge_by_field_config-X branch from a7615f9 to 81fe190 Compare October 5, 2025 17:32
Copy link

💡 Codex Review

# Audio lens and token_len are already in the correct shape
audio_lens = audio_input["lens"]
audio_token_len = audio_input["token_len"]
embeddings = self._audio_features_to_embeddings(audio_features, audio_lens)
# We should flatten and concatenate embeddings based on token lengths
# For example, with token_len = [4, 2, 3], flattened_embeddings will be
# concat(embeddings[0][:4], embeddings[1][:2], embeddings[2][:3])
# Create a mask of valid indices based on token lengths
max_len = embeddings.shape[1]
indices = torch.arange(max_len, device=embeddings.device).expand(
embeddings.shape[0], -1
)
mask = indices < audio_token_len[:, None]
# Apply mask and flatten
flattened_embeddings = embeddings[mask]
# Return one tensor per input audio
embed_lens = [
token_len_item.sum().item() for token_len_item in audio_input["token_len"]
]
return flattened_embeddings.split(embed_lens)

P1 Badge Splitting Ultravox embeddings per chunk instead of per audio

In _process_audio_input the lens and token lengths are now taken as-is (audio_token_len = audio_input["token_len"]) because the new merge_by_field_config path already flattens the B×N dimensions. However, embed_lens is still built by iterating over audio_input["token_len"] and summing each element. With the flattened inputs this loop now produces one entry per chunk rather than one entry per audio item, so flattened_embeddings.split(embed_lens) returns a list of tensors per chunk while _get_prompt_updates still emits only one PromptReplacement per audio based on audio_num_chunks. The number and ordering of multimodal embeddings therefore no longer matches the placeholders inserted into the prompt, which will misalign audio embeddings with their token replacements (and can raise when counts differ). Consider rebuilding embed_lens using the original per-audio grouping (e.g. via audio_num_chunks) so the function continues to return one tensor per audio item.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@DarkLight1337
Copy link
Member

Have you tested each model using the example script?

@ayushsatyam146 ayushsatyam146 force-pushed the merge_by_field_config-X branch from 81fe190 to 730e6e7 Compare October 6, 2025 05:44
@ayushsatyam146
Copy link
Contributor Author

@DarkLight1337 I have done the required changes. But, I am sorry since I couldn't run the example scripts due to GPU constraints

Signed-off-by: DarkLight1337 <[email protected]>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voxtral and Whisper are both failing locally, I have fixed Voxtral and will fix Whisper later

Signed-off-by: DarkLight1337 <[email protected]>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed Whisper as well, let's merge this

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 7, 2025 03:47
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025
Signed-off-by: DarkLight1337 <[email protected]>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Oct 7, 2025
@DarkLight1337 DarkLight1337 merged commit 5f7e8a9 into vllm-project:main Oct 7, 2025
55 checks passed
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
…#26261)

Signed-off-by: Ayush Satyam <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants