[Frontend] Add automatic language detection for Whisper transcription#34342
[Frontend] Add automatic language detection for Whisper transcription#34342spacecheck wants to merge 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: space_check <[email protected]>
Signed-off-by: space_check <[email protected]>
Signed-off-by: space_check <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request adds automatic language detection for Whisper transcriptions, which is a great feature. The implementation is mostly solid, with new tests covering the functionality. However, I've identified a critical issue related to potential race conditions when handling concurrent language detection requests. Please see the detailed comment for a fix.
Signed-off-by: space_check <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @spacecheck, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hello, Currently, we’re using just the <|startoftranscript|> token in the decoder prompt to trigger language detection. However, this method doesn't maintain the typical structure of the task prompt, as it’s limited to a single token for the language. In the standard approach (as in the forward method), we usually send a more complete sequence of tokens like , which ensures that the language token is included alongside the task-specific token. This is useful in preserving the expected format and might also benefit from better alignment with the task pipeline. That said, I understand the need for simplicity and appreciate the approach taken here. I just wanted to highlight this difference in case we’d like to consider it for future improvements, to potentially retain the task token while keeping language detection as flexible as possible. Please feel free to let me know your thoughts! |
Signed-off-by: space_check <[email protected]>
Thanks for the interest in this. The language detection here aligns closely with the original whisper repo detect_language function. (Now that I look back on it it might make sense to also force to get a language token like they do but I would first like to get the changes I made outside of whisper.py cleared or discussed with a code owner.) Could you point me to some resource confirming that the "standard approach" would be different, involving not a forward pass using a single token? |
|
Hello |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for contributing @spacecheck !
This is looking mostly good, only left a few comments
| from ...utils import RemoteOpenAIServer | ||
|
|
||
| MODEL_NAME = "openai/whisper-large-v3-turbo" | ||
| MODEL_NAME = "openai/whisper-large-v3" |
There was a problem hiding this comment.
why are we dropping the turbo model, is there some issue with lang token pred for this model?
There was a problem hiding this comment.
Yes, turbo really likes to predict the <|transcribe|> task token instead of any language token. The regular v3 guesses language tokens well though. I could try adding constrained generation to force a language token but would need to look into how to do that in vLLM.
| async def test_language_auto_detect_english(whisper_client, mary_had_lamb): | ||
| """Auto-detect language as English when no language param is provided.""" | ||
| transcription = await whisper_client.audio.transcriptions.create( | ||
| model=MODEL_NAME, | ||
| file=mary_had_lamb, | ||
| response_format="verbose_json", | ||
| temperature=0.0, | ||
| ) | ||
| assert transcription.language == "en" | ||
| assert "Mary had a little lamb" in transcription.text | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_language_auto_detect_italian(whisper_client, foscolo): |
There was a problem hiding this comment.
let's merge these tests into a single one with parametrization
|
|
||
|
|
||
| @runtime_checkable | ||
| class SupportsExplicitLanguageDetection(Protocol): |
There was a problem hiding this comment.
have you considered mergin methods below and have this class be a class attribute to SupportsTranscription similar to supports_transcription_only?
There was a problem hiding this comment.
I removed the SupportsExplicitLanguageDetection class and created the class attribute supports_explicit_language_detection.
When you wrote "mergin methods" did you mean combining get_language_detection_prompt and parse_language_detection_output? or just moving them into the SupportsTranscription class? Because I have done the second thing and don't think that the first thing would be a good idea because the engine_client.generate() call in speech_to_text.py sits between them, and the model class shouldn't know about the engine. The current split (model knows how to build the prompt and parse the output, the serving layer handles the actual inference call) is a cleaner separation of concerns.
Signed-off-by: space_check <[email protected]>
…f SupportsTranscription Signed-off-by: space_check <[email protected]>
adds feature from #14174, #25750
Purpose
Add automatic language detection for Whisper transcription when no
languageparameter is specified.<|startoftranscript|>as the decoder prompt and parsing the predicted language token (e.g.<|de|>)"en"if detection fails or produces an unrecognized tokenSupportsExplicitLanguageDetectionis addet, which is implemented by whisper to get the language detection prompt and parse it_detect_language) is model-agnostic and is only calledif language is None and isinstance(self.model_cls, SupportsExplicitLanguageDetection)and delegates prompt construction + output parsing to the model class, so other STT models (e.g. Voxtral) are unaffected and future models can implement their own explicit detection strategy if neededTest Plan
Test Result
I ran the three tests and all passed:
In regards to language detection duration I build vlllm and ran each command 5 times with my own audio files and took the difference of the averages:
With a 16 second German audio file this results in 325ms-294ms=31ms of difference for language detection (on my 4090).