Skip to content

[Frontend] Add automatic language detection for Whisper transcription#34342

Open
spacecheck wants to merge 7 commits intovllm-project:mainfrom
spacecheck:main
Open

[Frontend] Add automatic language detection for Whisper transcription#34342
spacecheck wants to merge 7 commits intovllm-project:mainfrom
spacecheck:main

Conversation

@spacecheck
Copy link

@spacecheck spacecheck commented Feb 11, 2026

adds feature from #14174, #25750

Purpose

Add automatic language detection for Whisper transcription when no language parameter is specified.

  • Whisper auto-detects the spoken language by running a single-token generation with <|startoftranscript|> as the decoder prompt and parsing the predicted language token (e.g. <|de|>)
  • Falls back to "en" if detection fails or produces an unrecognized token
  • To implement the functionality a new class SupportsExplicitLanguageDetection is addet, which is implemented by whisper to get the language detection prompt and parse it
  • The entrypoint orchestrator (_detect_language) is model-agnostic and is only called if language is None and isinstance(self.model_cls, SupportsExplicitLanguageDetection) and delegates prompt construction + output parsing to the model class, so other STT models (e.g. Voxtral) are unaffected and future models can implement their own explicit detection strategy if needed
  • Language detection adds only ~30ms overhead regardless of audio length (only the first 30s chunk is used), measured on an RTX 4090

Test Plan

 # Unit test (no GPU)
 pytest tests/models/multimodal/generation/test_whisper.py::test_parse_language_detection_output -v

 # Integration tests (GPU)
 pytest tests/entrypoints/openai/test_transcription_validation_whisper.py -v

Test Result

I ran the three tests and all passed:

tests/models/multimodal/generation/test_whisper.py::test_parse_language_detection_output PASSED

tests/entrypoints/openai/test_transcription_validation_whisper.py::test_basic_audio PASSED                                                                                                                                                          [  7%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_basic_audio_batched PASSED                                                                                                                                                  [ 15%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_bad_requests PASSED                                                                                                                                                         [ 23%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_long_audio_request PASSED                                                                                                                                                   [ 30%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_completion_endpoints PASSED                                                                                                                                                 [ 38%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_streaming_response PASSED                                                                                                                                                   [ 46%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_stream_options PASSED                                                                                                                                                       [ 53%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_sampling_params PASSED                                                                                                                                                      [ 61%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_prompt PASSED                                                                                                                                                         [ 69%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_with_timestamp PASSED                                                                                                                                                 [ 76%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_with_max_tokens PASSED                                                                                                                                                [ 84%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_language_auto_detect_english PASSED                                                                                                                                         [ 92%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_language_auto_detect_italian PASSED                                                                                                                                         [100%]

In regards to language detection duration I build vlllm and ran each command 5 times with my own audio files and took the difference of the averages:

time curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "language=de"
time curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]"

With a 16 second German audio file this results in 325ms-294ms=31ms of difference for language detection (on my 4090).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds automatic language detection for Whisper transcriptions, which is a great feature. The implementation is mostly solid, with new tests covering the functionality. However, I've identified a critical issue related to potential race conditions when handling concurrent language detection requests. Please see the detailed comment for a fix.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Feb 11, 2026

Hi @spacecheck, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@warichet
Copy link

warichet commented Feb 11, 2026

Hello,
I wanted to share a small observation regarding the current implementation of language detection. While this approach works well in terms of predicting the language automatically, it slightly diverges from the standard workflow in the following way:

Currently, we’re using just the <|startoftranscript|> token in the decoder prompt to trigger language detection. However, this method doesn't maintain the typical structure of the task prompt, as it’s limited to a single token for the language.

In the standard approach (as in the forward method), we usually send a more complete sequence of tokens like , which ensures that the language token is included alongside the task-specific token. This is useful in preserving the expected format and might also benefit from better alignment with the task pipeline.

That said, I understand the need for simplicity and appreciate the approach taken here. I just wanted to highlight this difference in case we’d like to consider it for future improvements, to potentially retain the task token while keeping language detection as flexible as possible.

Please feel free to let me know your thoughts!

whisper.py

@spacecheck
Copy link
Author

Hello, I wanted to share a small observation regarding the current implementation of language detection. While this approach works well in terms of predicting the language automatically, it slightly diverges from the standard workflow in the following way:

Currently, we’re using just the <|startoftranscript|> token in the decoder prompt to trigger language detection. However, this method doesn't maintain the typical structure of the task prompt, as it’s limited to a single token for the language.

In the standard approach (as in the forward method), we usually send a more complete sequence of tokens like , which ensures that the language token is included alongside the task-specific token. This is useful in preserving the expected format and might also benefit from better alignment with the task pipeline.

That said, I understand the need for simplicity and appreciate the approach taken here. I just wanted to highlight this difference in case we’d like to consider it for future improvements, to potentially retain the task token while keeping language detection as flexible as possible.

Please feel free to let me know your thoughts!

whisper.py

Thanks for the interest in this. The language detection here aligns closely with the original whisper repo detect_language function. (Now that I look back on it it might make sense to also force to get a language token like they do but I would first like to get the changes I made outside of whisper.py cleared or discussed with a code owner.) Could you point me to some resource confirming that the "standard approach" would be different, involving not a forward pass using a single token?

@warichet
Copy link

Hello
in the OpenAI implementation, a single token is also passed initially, but the task-specific token is retained, which is difficult to predict. From the tests I've conducted, Whisper doesn't predict this token, and this can affect the results, as these tokens are used during the decoding phase. I believe this is an important point to consider for improving the accuracy of the output.

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing @spacecheck !
This is looking mostly good, only left a few comments

from ...utils import RemoteOpenAIServer

MODEL_NAME = "openai/whisper-large-v3-turbo"
MODEL_NAME = "openai/whisper-large-v3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we dropping the turbo model, is there some issue with lang token pred for this model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, turbo really likes to predict the <|transcribe|> task token instead of any language token. The regular v3 guesses language tokens well though. I could try adding constrained generation to force a language token but would need to look into how to do that in vLLM.

Comment on lines 279 to 292
async def test_language_auto_detect_english(whisper_client, mary_had_lamb):
"""Auto-detect language as English when no language param is provided."""
transcription = await whisper_client.audio.transcriptions.create(
model=MODEL_NAME,
file=mary_had_lamb,
response_format="verbose_json",
temperature=0.0,
)
assert transcription.language == "en"
assert "Mary had a little lamb" in transcription.text


@pytest.mark.asyncio
async def test_language_auto_detect_italian(whisper_client, foscolo):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's merge these tests into a single one with parametrization

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I fixed this



@runtime_checkable
class SupportsExplicitLanguageDetection(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered mergin methods below and have this class be a class attribute to SupportsTranscription similar to supports_transcription_only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the SupportsExplicitLanguageDetection class and created the class attribute supports_explicit_language_detection.

When you wrote "mergin methods" did you mean combining get_language_detection_prompt and parse_language_detection_output? or just moving them into the SupportsTranscription class? Because I have done the second thing and don't think that the first thing would be a good idea because the engine_client.generate() call in speech_to_text.py sits between them, and the model class shouldn't know about the engine. The current split (model knows how to build the prompt and parse the output, the serving layer handles the actual inference call) is a cleaner separation of concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend multi-modality Related to multi-modality (#4194)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants