[Frontend] Add automatic language detection for Whisper transcription by spacecheck · Pull Request #34342 · vllm-project/vllm

spacecheck · 2026-02-11T14:24:07Z

adds feature from #14174, #25750

Purpose

Add automatic language detection for Whisper transcription when no language parameter is specified.

Whisper auto-detects the spoken language by running a single-token generation with <|startoftranscript|> as the decoder prompt and parsing the predicted language token (e.g. <|de|>)
Falls back to "en" if detection fails or produces an unrecognized token
To implement the functionality a new class SupportsExplicitLanguageDetection is addet, which is implemented by whisper to get the language detection prompt and parse it
The entrypoint orchestrator (_detect_language) is model-agnostic and is only called if language is None and isinstance(self.model_cls, SupportsExplicitLanguageDetection) and delegates prompt construction + output parsing to the model class, so other STT models (e.g. Voxtral) are unaffected and future models can implement their own explicit detection strategy if needed
Language detection adds only ~30ms overhead regardless of audio length (only the first 30s chunk is used), measured on an RTX 4090

Test Plan

 # Unit test (no GPU)
 pytest tests/models/multimodal/generation/test_whisper.py::test_parse_language_detection_output -v

 # Integration tests (GPU)
 pytest tests/entrypoints/openai/test_transcription_validation_whisper.py -v

Test Result

I ran the three tests and all passed:

tests/models/multimodal/generation/test_whisper.py::test_parse_language_detection_output PASSED

tests/entrypoints/openai/test_transcription_validation_whisper.py::test_basic_audio PASSED                                                                                                                                                          [  7%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_basic_audio_batched PASSED                                                                                                                                                  [ 15%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_bad_requests PASSED                                                                                                                                                         [ 23%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_long_audio_request PASSED                                                                                                                                                   [ 30%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_completion_endpoints PASSED                                                                                                                                                 [ 38%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_streaming_response PASSED                                                                                                                                                   [ 46%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_stream_options PASSED                                                                                                                                                       [ 53%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_sampling_params PASSED                                                                                                                                                      [ 61%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_prompt PASSED                                                                                                                                                         [ 69%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_with_timestamp PASSED                                                                                                                                                 [ 76%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_audio_with_max_tokens PASSED                                                                                                                                                [ 84%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_language_auto_detect_english PASSED                                                                                                                                         [ 92%]
tests/entrypoints/openai/test_transcription_validation_whisper.py::test_language_auto_detect_italian PASSED                                                                                                                                         [100%]

In regards to language detection duration I build vlllm and ran each command 5 times with my own audio files and took the difference of the averages:

time curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "language=de"
time curl http://localhost:8000/v1/audio/transcriptions -F "[email protected]"

With a 16 second German audio file this results in 325ms-294ms=31ms of difference for language detection (on my 4090).

Signed-off-by: space_check <[email protected]>

gemini-code-assist

Code Review

This pull request adds automatic language detection for Whisper transcriptions, which is a great feature. The implementation is mostly solid, with new tests covering the functionality. However, I've identified a critical issue related to potential race conditions when handling concurrent language detection requests. Please see the detailed comment for a fix.

vllm/entrypoints/openai/speech_to_text/speech_to_text.py

Signed-off-by: space_check <[email protected]>

github-actions · 2026-02-11T16:50:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-02-11T16:59:16Z

Hi @spacecheck, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

warichet · 2026-02-11T17:44:51Z

Hello,
I wanted to share a small observation regarding the current implementation of language detection. While this approach works well in terms of predicting the language automatically, it slightly diverges from the standard workflow in the following way:

Currently, we’re using just the <|startoftranscript|> token in the decoder prompt to trigger language detection. However, this method doesn't maintain the typical structure of the task prompt, as it’s limited to a single token for the language.

In the standard approach (as in the forward method), we usually send a more complete sequence of tokens like , which ensures that the language token is included alongside the task-specific token. This is useful in preserving the expected format and might also benefit from better alignment with the task pipeline.

That said, I understand the need for simplicity and appreciate the approach taken here. I just wanted to highlight this difference in case we’d like to consider it for future improvements, to potentially retain the task token while keeping language detection as flexible as possible.

Please feel free to let me know your thoughts!

whisper.py

Signed-off-by: space_check <[email protected]>

spacecheck · 2026-02-11T18:27:27Z

Hello, I wanted to share a small observation regarding the current implementation of language detection. While this approach works well in terms of predicting the language automatically, it slightly diverges from the standard workflow in the following way:

Currently, we’re using just the <|startoftranscript|> token in the decoder prompt to trigger language detection. However, this method doesn't maintain the typical structure of the task prompt, as it’s limited to a single token for the language.

In the standard approach (as in the forward method), we usually send a more complete sequence of tokens like , which ensures that the language token is included alongside the task-specific token. This is useful in preserving the expected format and might also benefit from better alignment with the task pipeline.

That said, I understand the need for simplicity and appreciate the approach taken here. I just wanted to highlight this difference in case we’d like to consider it for future improvements, to potentially retain the task token while keeping language detection as flexible as possible.

Please feel free to let me know your thoughts!

whisper.py

Thanks for the interest in this. The language detection here aligns closely with the original whisper repo detect_language function. (Now that I look back on it it might make sense to also force to get a language token like they do but I would first like to get the changes I made outside of whisper.py cleared or discussed with a code owner.) Could you point me to some resource confirming that the "standard approach" would be different, involving not a forward pass using a single token?

warichet · 2026-02-12T06:45:23Z

Hello
in the OpenAI implementation, a single token is also passed initially, but the task-specific token is retained, which is difficult to predict. From the tests I've conducted, Whisper doesn't predict this token, and this can affect the results, as these tokens are used during the decoding phase. I believe this is an important point to consider for improving the accuracy of the output.

NickLucche

Thanks for contributing @spacecheck !
This is looking mostly good, only left a few comments

NickLucche · 2026-02-12T08:19:15Z

tests/entrypoints/openai/test_transcription_validation_whisper.py

 from ...utils import RemoteOpenAIServer

-MODEL_NAME = "openai/whisper-large-v3-turbo"
+MODEL_NAME = "openai/whisper-large-v3"


why are we dropping the turbo model, is there some issue with lang token pred for this model?

Yes, turbo really likes to predict the <|transcribe|> task token instead of any language token. The regular v3 guesses language tokens well though. I could try adding constrained generation to force a language token but would need to look into how to do that in vLLM.

NickLucche · 2026-02-12T08:19:52Z

tests/entrypoints/openai/test_transcription_validation_whisper.py

+async def test_language_auto_detect_english(whisper_client, mary_had_lamb):
+    """Auto-detect language as English when no language param is provided."""
+    transcription = await whisper_client.audio.transcriptions.create(
+        model=MODEL_NAME,
+        file=mary_had_lamb,
+        response_format="verbose_json",
+        temperature=0.0,
+    )
+    assert transcription.language == "en"
+    assert "Mary had a little lamb" in transcription.text
+
+
+@pytest.mark.asyncio
+async def test_language_auto_detect_italian(whisper_client, foscolo):


let's merge these tests into a single one with parametrization

thanks, I fixed this

NickLucche · 2026-02-12T09:20:42Z

vllm/model_executor/models/interfaces.py



+@runtime_checkable
+class SupportsExplicitLanguageDetection(Protocol):


have you considered mergin methods below and have this class be a class attribute to SupportsTranscription similar to supports_transcription_only?

I removed the SupportsExplicitLanguageDetection class and created the class attribute supports_explicit_language_detection.

When you wrote "mergin methods" did you mean combining get_language_detection_prompt and parse_language_detection_output? or just moving them into the SupportsTranscription class? Because I have done the second thing and don't think that the first thing would be a good idea because the engine_client.generate() call in speech_to_text.py sits between them, and the model class shouldn't know about the engine. The current split (model knows how to build the prompt and parse the output, the serving layer handles the actual inference call) is a cleaner separation of concerns.

Signed-off-by: space_check <[email protected]>

…f SupportsTranscription Signed-off-by: space_check <[email protected]>

spacecheck added 3 commits February 11, 2026 14:41

add language auto detection to whisper

71b9c3d

Signed-off-by: space_check <[email protected]>

add whisper language detection tests

884cd98

Signed-off-by: space_check <[email protected]>

change model unter test to v3 no turbo

8003914

Signed-off-by: space_check <[email protected]>

spacecheck requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and ywang96 as code owners February 11, 2026 14:24

mergify bot added frontend multi-modality Related to multi-modality (#4194) labels Feb 11, 2026

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

vllm/entrypoints/openai/speech_to_text/speech_to_text.py Show resolved Hide resolved

spacecheck mentioned this pull request Feb 11, 2026

[Feature]: Tracking Whisper feature requests #25750

Open

7 tasks

make language dection requests get unique request id

1dff38d

Signed-off-by: space_check <[email protected]>

add SupportsExplicitLanguageDetection Protocol

eb7b393

Signed-off-by: space_check <[email protected]>

NickLucche reviewed Feb 12, 2026

View reviewed changes

spacecheck added 2 commits February 12, 2026 16:11

merge language detection tests into parametrized test

86bad2c

Signed-off-by: space_check <[email protected]>

refactor SupportsExplicitLanguageDetection class to class attribute o…

1e20869

…f SupportsTranscription Signed-off-by: space_check <[email protected]>



		@runtime_checkable
		class SupportsExplicitLanguageDetection(Protocol):

Uh oh!

Conversation

spacecheck commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

warichet commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spacecheck commented Feb 11, 2026

Uh oh!

warichet commented Feb 12, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

spacecheck Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

spacecheck Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

spacecheck Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spacecheck commented Feb 11, 2026 •

edited by github-actions bot

Loading

warichet commented Feb 11, 2026 •

edited

Loading