-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Frontend] Gemma3n audio transcriptions/translations endpoint
#23735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables Gemma3n for audio transcription and translation endpoints, which is a great addition. The changes include a soft API modification to add a to_language parameter, which will be useful for future enhancements. The tests have been updated to cover Gemma3n, including parameterization over different models, which is good practice. I've found one issue regarding input validation for the new model implementation that should be addressed.
| if task_type == "transcribe" and full_lang_name: | ||
| prompt += f" into {full_lang_name}" | ||
| elif task_type == "translate": | ||
| if full_lang_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate that both languages are valid when doing translation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming languages are validated beforehand here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L91.
Do you have some extra checks in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in that case perhaps we should pass the full_lang_name directly into the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we should have a separate function for each task to reduce branching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate function for each task to reduce branching
I think this may cause duplication for the other models as of now.
The change makes sense, I just wanted to wait and see a bit more models supported here before changing the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we can merge this PR for now then, thanks
Signed-off-by: NickLucche <[email protected]>
|
@NickLucche Can we expect LoRA support for text and audio modules for Gemma3n. [https://github.com//issues/21746](relevant issue #21746) |
|
@pratapyash check #24003 out |
|
Facing a bug when running multimodal inference (specifically audio) with Gemma3n. Would be relevant to this PR. #24006 Summary:
|
Signed-off-by: NickLucche <[email protected]>
|
@DarkLight1337 things look green here |
…lm-project#23735) Signed-off-by: NickLucche <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
…lm-project#23735) Signed-off-by: NickLucche <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
This PR enables Gemma3n for use with the audio-specific endpoints (transcriptions/translations).
I've also added a "soft" interface changes to add a
to_languageparameter to the API as I found it helps some with translation.The rationale is that I would like to keep this changes lightweight for now as we're only slightly steering away from the original oai whisper-only specs, and instead see where the broader audio community wants it to be.
No chunking for now, as I believe a long-audio capability assessment is in order for this model.
A list of additional minor changes:
I also plan to follow up with revamped benchmark+evaluation scripts to better cover these models.