Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Aug 27, 2025

This PR enables Gemma3n for use with the audio-specific endpoints (transcriptions/translations).

I've also added a "soft" interface changes to add a to_language parameter to the API as I found it helps some with translation.
The rationale is that I would like to keep this changes lightweight for now as we're only slightly steering away from the original oai whisper-only specs, and instead see where the broader audio community wants it to be.

No chunking for now, as I believe a long-audio capability assessment is in order for this model.

A list of additional minor changes:

  • conftest.py for audio entrypoints tests
  • seed params for translations
  • whisper+gemma3n audio tests with module-level server fixture

I also plan to follow up with revamped benchmark+evaluation scripts to better cover these models.

# pre 
python -m pytest tests/entrypoints/openai/test_translation_validation.py  146.99s user 22.77s system 154% cpu 1:50.18 total

# post
python -m pytest tests/entrypoints/openai/test_translation_validation.py  243.81s user 31.39s system 146% cpu 3:07.72 total

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@NickLucche
Copy link
Collaborator Author

cc @DarkLight1337

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables Gemma3n for audio transcription and translation endpoints, which is a great addition. The changes include a soft API modification to add a to_language parameter, which will be useful for future enhancements. The tests have been updated to cover Gemma3n, including parameterization over different models, which is good practice. I've found one issue regarding input validation for the new model implementation that should be addressed.

if task_type == "transcribe" and full_lang_name:
prompt += f" into {full_lang_name}"
elif task_type == "translate":
if full_lang_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate that both languages are valid when doing translation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming languages are validated beforehand here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L91.
Do you have some extra checks in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case perhaps we should pass the full_lang_name directly into the method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we should have a separate function for each task to reduce branching

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate function for each task to reduce branching

I think this may cause duplication for the other models as of now.
The change makes sense, I just wanted to wait and see a bit more models supported here before changing the interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can merge this PR for now then, thanks

Signed-off-by: NickLucche <[email protected]>
@pratapyash
Copy link
Contributor

@NickLucche Can we expect LoRA support for text and audio modules for Gemma3n. [https://github.com//issues/21746](relevant issue #21746)

@NickLucche
Copy link
Collaborator Author

@pratapyash check #24003 out

@pratapyash
Copy link
Contributor

Facing a bug when running multimodal inference (specifically audio) with Gemma3n. Would be relevant to this PR. #24006

Summary:

  • In gemma3n_mm.py::_process_audio_input we call:
    audio_input["input_features"].squeeze(1)
  • For batched audio requests, input_features arrives as a Python list →
    AttributeError: 'list' object has no attribute 'squeeze' → EngineCore dies.
  • Result: repeated HTTP 500s on /v1/chat/completions and NCCL shutdown warning.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 1, 2025
Signed-off-by: NickLucche <[email protected]>
@NickLucche
Copy link
Collaborator Author

@DarkLight1337 things look green here

@DarkLight1337 DarkLight1337 merged commit d46934b into vllm-project:main Sep 1, 2025
43 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants