-
Notifications
You must be signed in to change notification settings - Fork 277
[feat]: add transcription API endpoint using OpenAI Whisper-small #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat]: add transcription API endpoint using OpenAI Whisper-small #469
Conversation
I’ve put together the steps I followed here](https://davidgao7.github.io/posts/vllm-v1-whisper-transcription/) -- would love any feedback if I missed something or got anything wrong! 😊 |
We have been testing this and it works fine! pretty happy with this PR. when building your branch we noticed its a couple of commits behind the main repo causing it to fail. when we aligned the files with main it worked like a charm. |
@davidgao7 This is super awesome! I'll take a look |
src/vllm_router/run-router.sh
Outdated
@@ -35,3 +35,19 @@ python3 -m vllm_router.app --port "$1" \ | |||
# --engine-stats-interval 10 \ | |||
# --log-stats | |||
# | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's don't touch this example router running script. Could you create a simple tutorial for this new feature and put it under tutorials/
?
|
||
# return the whisper response unmodified | ||
resp = proxied.json() | ||
logger.debug("==== Whisper response payload ====") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these debugging log, if possible?
resp = proxied.json() | ||
logger.debug("==== Whisper response payload ====") | ||
logger.debug(resp) | ||
logger.debug("==== Whisper response payload ====") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this one
Hey @davidgao7 this is an awesome feature to have! A general comment: can you remove all the debugging messages in the code? |
@YuhanLiu11 Got it! Just make things clear, I'll
|
ea2074e
to
95d12b0
Compare
Hi @YuhanLiu11! Just a quick update — the debug logger print you mentioned in |
src/vllm_router/utils.py
Outdated
case ModelType.transcription: | ||
return { | ||
"file": "", | ||
"model": "openai/whisper-small", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the model should not be included in this payload
logger.debug("==== Total endpoints ====") | ||
|
||
# TODO: right now is skipping label check in code for local testing | ||
endpoints = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have the logic implemented here: https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/service_discovery.py#L282
fbc06f4
to
f41d717
Compare
src/vllm_router/utils.py
Outdated
@@ -75,6 +76,10 @@ def get_test_payload(model_type: str): | |||
return {"query": "Hello", "documents": ["Test"]} | |||
case ModelType.score: | |||
return {"encoding_format": "float", "text_1": "Test", "test_2": "Test2"} | |||
case ModelType.transcription: | |||
return { | |||
"file": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to test the payload so ideally we would have an example audio file here. Maybe we can generate a very short one on application startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or does it also work with an empty file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to test the payload so ideally we would have an example audio file here. Maybe we can generate a very short one on application startup?
or does it also work with an empty file?
Hi Max,
Thanks for the suggestion! I’m planning to go with the empty audio file approach first — the idea is to pass an empty in-memory file (BytesIO()) as the payload to test compatibility with the transcription endpoint. If that works reliably, it’ll keep things clean without needing to generate or manage any example files.
I’ll test it out today and report back on whether it behaves as expected. If it doesn’t work well, we can revisit the idea of generating a short clip at startup.
Let me know if you have any thoughts before I proceed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to test, if the test payload works by using --static-backend-health-checks
.
For this to work, you have to go via the normal endpoint health API though: https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/service_discovery.py#L282
If you have any questions, feel free to just ask and I can try to help as well!
@davidgao7 Hi David, would you mind fixing the pre-commit check issue and the review comment mentioned above? |
Hi @zerofishnoodles — thanks for the reminder! Sorry for the delay, I’ve been a bit tied up but finally have some solid time blocks again and can jump back in. I’ll fix the pre-commit issue and address Max’s review comment shortly! |
src/vllm_router/utils.py
Outdated
case ModelType.transcription: | ||
# Generate a 0.1 second silent audio file | ||
with io.BytesIO() as wav_buffer: | ||
with wave.open(wav_buffer, "wb") as wf: | ||
wf.setnchannels(1) # mono audio channel, standard configuration | ||
wf.setsampwidth(2) # 16 bit audio, common bit depth for wav file | ||
wf.setframerate(16000) # 16 kHz sample rate | ||
wf.writeframes(b"\x00\x00" * 1600) # 0.1 second of silence | ||
|
||
# retrieves the generated wav bytes, return | ||
wav_bytes = wav_buffer.getvalue() | ||
|
||
return { | ||
"file": ("empty.wav", wav_bytes, "audio/wav"), | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool!
Minor comment, feel free to ignore: Could we add this as a constant somewhere to avoid the bytebuffer to be recreated every minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@max-wittig Hi Max,
Quick update on the silent WAV bytes. I got it implemented as a module-level constant (_SILENT_WAV_BYTES
) in src/vllm_router/utils.py, so that's all squared away on preventing repeated creation, which is great!
While I was digging into the health checks for the transcription
model, I noticed an interesting log pop up, like this one:

It looks like the is_model_healthy
function (line 185 in utils.py) is designed to send a JSON payload (around line 190) for health checks, which works perfectly for most models!
The core issue is this:
The transcription
endpoint (/v1/audio/transcriptions
) on the Whisper backend specifically expects multipart/form-data
with an audio file – just like how the main transcription endpoint in main_router.py
is designed to handle user uploads (it uses UploadFile
and builds files=files
). So, when the health check sends JSON, the backend doesn't quite know what to do with it and rejects the request, making the health check show up as unhealthy
.
I was thinking a good way to get this working smoothly for transcription
models would be to adjust the is_model_healthy
function to specifically send a minimal multipart/form-data
request, using our _SILENT_WAV_BYTES
. This would let us correctly mimic a real audio upload for the health check.
Here’s roughly what that change in src/vllm_router/utils.py
might look like:
# In is_model_healthy function in utils.py
def is_model_healthy(url: str, model: str, model_type: str) -> bool:
model_details = ModelType[model_type]
try:
if model_type == "transcription":
# For transcription, the backend expects multipart/form-data with a file.
# We'll use our pre-generated silent WAV bytes.
files = {
'file': ('silent.wav', _SILENT_WAV_BYTES, 'audio/wav')
}
# The model name also needs to be sent as part of the form data
data = {'model': model}
response = requests.post(
f"{url}{model_details.value}",
files=files, # This sends multipart/form-data
data=data # And this includes other form fields
)
else:
# Existing logic for other model types (chat, completion, etc.)
response = requests.post(
f"{url}{model_details.value}",
headers={"Content-Type": "application/json"},
json={"model": model} | model_details.get_test_payload()
)
response.raise_for_status() # Throws an error for 4xx/5xx responses
# For transcription, we just need to confirm a 200 OK.
# Other models might need to parse JSON.
if model_type == "transcription":
return True
else:
response.json() # Verify it's valid JSON for other model types
return True
except requests.exceptions.RequestException as e:
logger.warning(f"{model_type} model {model} at {url} not healthy: {e}")
return False
except json.JSONDecodeError as e:
logger.error(f"Failed to decode JSON from {model_type} model {model} at {url}: {e}")
return False
# And then ModelType.transcription.get_test_payload() could just return {}
# since it wouldn't be used by this new multipart path:
# In ModelType class in utils.py
class ModelType(enum.Enum):
chat = "/v1/chat/completions"
completion = "/v1/completions"
embeddings = "/v1/embeddings"
rerank = "/v1/rerank"
score = "/v1/score"
transcription = "/v1/audio/transcriptions"
@staticmethod
def get_test_payload(model_type: str):
match ModelType[model_type]:
case ModelType.chat:
return {
"messages": [
{
"role": "user",
"content": "Hello",
}
],
"temperature": 0.0,
"max_tokens": 3,
"max_completion_tokens": 3,
}
case ModelType.completion:
return {"prompt": "Hello"}
case ModelType.embeddings:
return {"input": "Hello"}
case ModelType.rerank:
return {"query": "Hello", "documents": ["Test"]}
case ModelType.score:
return {"encoding_format": "float", "text_1": "Test", "test_2": "Test2"}
case ModelType.transcription:
# This payload is for JSON part of the request.
# The file bytes will be handled separately via 'files=' parameter
# in the is_model_healthy function for transcription health checks.
return {} # Changed back to an empty dictionary
@staticmethod
def get_all_fields():
return [model_type.name for model_type in ModelType]
What do you think about this approach? I believe it should get the health checks reliably working for transcription models!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking. This looks like a cool approach. We will have to think of something more general, once we adopt other model types with similar requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please take a little time to fix the GitHub check issue?
router = get_routing_logic() | ||
|
||
# pick one using the router's configured logic (roundrobin, least-loaded, etc.) | ||
chosen_url = router.route_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to put the routing logic like getting the endpoint, routing the endpoint and proxy to the request service to keep the main_router concise?
response_format: str | None = Form("json"), | ||
temperature: float | None = Form(None), | ||
language: str = Form("en"), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to align the parameter with the other endpoints? And also perhaps adding request-id would be a good one.
a11285c
to
1cbca9e
Compare
Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
new script will be mentioned in `tutorials/17-whisper-api-transcription.md` Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
… time Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
1cbca9e
to
d7cc2a3
Compare
Hi everyone! Just wanted to give a quick update regarding the commit history on this PR. You might notice a sudden change in the number of commits. This was a deliberate action on my part: I performed an interactive rebase to ensure all my commits are properly signed off (for DCO compliance) and to clean up the branch's history so it only contains the relevant changes for this feature. The actual code changes introduced by this PR remain the same as previously discussed, but the commit history is now much cleaner and compliant. Apologies if this looked like a sudden flurry of activity! Just wanted to ensure everything was in order before final review. Thanks for your understanding! |
Signed-off-by: David Gao <[email protected]>
Hi @max-wittig,
The branch should be ready for another workflow run when you have a moment. I'll keep an eye on the results and am ready for any further feedback. Thanks! |
Hi @davidgao7 , starting from this MR: #589 So I'd advise not to re introduce |
Replaced httpx with aiohttp for better asynchronous performance and resource utilization. Fixed JSON syntax error in error response handling. Signed-off-by: David Gao <[email protected]>
Signed-off-by: David Gao <[email protected]>
d87546d
to
12fcb2a
Compare
Signed-off-by: David Gao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, but generally I like it
Signed-off-by: David Gao <[email protected]>
Hi, can you fix the pre-commit issue? Also a suggestion, it would be better to install the git pre-commit and also run pre-commit manually before committing. |
@YuhanLiu11 Could you take a look at this? I think this has been open so long by now and it would provide a big benefit for the people. |
Yeah I just started running the CI test and let's see if it passes the tests. |
Thanks for reviewing! All CI checks are now passing. Please let me know if there's anything else I can address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing all the comments!
Thank you all for this journey! We will test it! |
…lm-project#469) * [feat]: add transcription API endpoint using OpenAI Whisper-small Signed-off-by: David Gao <[email protected]> * remove the whisper payload response log Signed-off-by: David Gao <[email protected]> * [docs]: add tutorial for transcription v1 api Signed-off-by: David Gao <[email protected]> * [chore] align example router running script with main new script will be mentioned in `tutorials/17-whisper-api-transcription.md` Signed-off-by: David Gao <[email protected]> * omit model field since backend already knows which model to run Signed-off-by: David Gao <[email protected]> * generate a silent audio file if no audio file appears Signed-off-by: David Gao <[email protected]> * put wav creation at the module level to prevent being recreated every time Signed-off-by: David Gao <[email protected]> * [Test] test frequency of silent audio creation Signed-off-by: David Gao <[email protected]> * send multipart/form-data for transcription model's health check Signed-off-by: David Gao <[email protected]> * fix pre-commit issue Signed-off-by: David Gao <[email protected]> * Moves the implementation for the `/v1/audio/transcriptions` endpoint from `main_router.py` into `request.py`, align architectural pattern. Signed-off-by: David Gao <[email protected]> * add timeout to ensure health check will not hang indefinitely if a backend model becomes unresponsive Signed-off-by: David Gao <[email protected]> * add boolean model health check return for non-transcription model Signed-off-by: David Gao <[email protected]> * remove redundant warning log since handled in outer 'StaticServiceDiscovery.get_unhealthy_endpoint_hashes' Signed-off-by: David Gao <[email protected]> * remove redundant JSONDecodeError catch and downgrade RequestException log to debug, align with service discovery's warning Signed-off-by: David Gao <[email protected]> * Chore: Apply auto-formatting and linting fixes via pre-commit Signed-off-by: David Gao <[email protected]> * refactor: update more meaningful comments for silent wav bytes generation Signed-off-by: David Gao <[email protected]> * refactor: keep the comment to explain purpose for generating a silent WAV byte Signed-off-by: David Gao <[email protected]> * fix(tests): Improve mock in model health check test The mock for `requests.post` in `test_is_model_healthy` did not correctly simulate an `HTTPError` on a non-200 response. This change configures the mock's `raise_for_status` method to raise the appropriate exception, ensuring the test now accurately validates the function's error handling logic. Signed-off-by: David Gao <[email protected]> * Chore: Apply auto-formatting and linting fixes via pre-commit Signed-off-by: David Gao <[email protected]> * chore: remove unused var `in_router_time` Signed-off-by: David Gao <[email protected]> * fix: (deps) add httpx as an explicit dependency The CI/CD workflow was failing with a `ModuleNotFoundError` because `httpx` was not an explicit dependency in `pyproject.toml` and was not being installed in the clean Docker environment. Signed-off-by: David Gao <[email protected]> * chore: dependencies order changes after running pre-commit Signed-off-by: David Gao <[email protected]> * refactor: Migration from httpx to aiohttp for improved concurrency Replaced httpx with aiohttp for better asynchronous performance and resource utilization. Fixed JSON syntax error in error response handling. Signed-off-by: David Gao <[email protected]> * chore: remove wrong tutorial file Signed-off-by: David Gao <[email protected]> * chore: apply pre-commit Signed-off-by: David Gao <[email protected]> * chore: use debug log print Signed-off-by: David Gao <[email protected]> * chore: change to more specific exception handling for aiohttp Signed-off-by: David Gao <[email protected]> --------- Signed-off-by: David Gao <[email protected]> Co-authored-by: Yuhan Liu <[email protected]> Signed-off-by: Ifta Khairul Alam Adil <[email protected]>
…lm-project#469) * [feat]: add transcription API endpoint using OpenAI Whisper-small Signed-off-by: David Gao <[email protected]> * remove the whisper payload response log Signed-off-by: David Gao <[email protected]> * [docs]: add tutorial for transcription v1 api Signed-off-by: David Gao <[email protected]> * [chore] align example router running script with main new script will be mentioned in `tutorials/17-whisper-api-transcription.md` Signed-off-by: David Gao <[email protected]> * omit model field since backend already knows which model to run Signed-off-by: David Gao <[email protected]> * generate a silent audio file if no audio file appears Signed-off-by: David Gao <[email protected]> * put wav creation at the module level to prevent being recreated every time Signed-off-by: David Gao <[email protected]> * [Test] test frequency of silent audio creation Signed-off-by: David Gao <[email protected]> * send multipart/form-data for transcription model's health check Signed-off-by: David Gao <[email protected]> * fix pre-commit issue Signed-off-by: David Gao <[email protected]> * Moves the implementation for the `/v1/audio/transcriptions` endpoint from `main_router.py` into `request.py`, align architectural pattern. Signed-off-by: David Gao <[email protected]> * add timeout to ensure health check will not hang indefinitely if a backend model becomes unresponsive Signed-off-by: David Gao <[email protected]> * add boolean model health check return for non-transcription model Signed-off-by: David Gao <[email protected]> * remove redundant warning log since handled in outer 'StaticServiceDiscovery.get_unhealthy_endpoint_hashes' Signed-off-by: David Gao <[email protected]> * remove redundant JSONDecodeError catch and downgrade RequestException log to debug, align with service discovery's warning Signed-off-by: David Gao <[email protected]> * Chore: Apply auto-formatting and linting fixes via pre-commit Signed-off-by: David Gao <[email protected]> * refactor: update more meaningful comments for silent wav bytes generation Signed-off-by: David Gao <[email protected]> * refactor: keep the comment to explain purpose for generating a silent WAV byte Signed-off-by: David Gao <[email protected]> * fix(tests): Improve mock in model health check test The mock for `requests.post` in `test_is_model_healthy` did not correctly simulate an `HTTPError` on a non-200 response. This change configures the mock's `raise_for_status` method to raise the appropriate exception, ensuring the test now accurately validates the function's error handling logic. Signed-off-by: David Gao <[email protected]> * Chore: Apply auto-formatting and linting fixes via pre-commit Signed-off-by: David Gao <[email protected]> * chore: remove unused var `in_router_time` Signed-off-by: David Gao <[email protected]> * fix: (deps) add httpx as an explicit dependency The CI/CD workflow was failing with a `ModuleNotFoundError` because `httpx` was not an explicit dependency in `pyproject.toml` and was not being installed in the clean Docker environment. Signed-off-by: David Gao <[email protected]> * chore: dependencies order changes after running pre-commit Signed-off-by: David Gao <[email protected]> * refactor: Migration from httpx to aiohttp for improved concurrency Replaced httpx with aiohttp for better asynchronous performance and resource utilization. Fixed JSON syntax error in error response handling. Signed-off-by: David Gao <[email protected]> * chore: remove wrong tutorial file Signed-off-by: David Gao <[email protected]> * chore: apply pre-commit Signed-off-by: David Gao <[email protected]> * chore: use debug log print Signed-off-by: David Gao <[email protected]> * chore: change to more specific exception handling for aiohttp Signed-off-by: David Gao <[email protected]> --------- Signed-off-by: David Gao <[email protected]> Co-authored-by: Yuhan Liu <[email protected]> Signed-off-by: Ifta Khairul Alam Adil <[email protected]>
[Feat] Add transcription API endpoint using OpenAI Whisper-small
This PR implements a new
/v1/audio/transcriptions
route in the vllm-router so you can upload audio files directly through the router and have them forwarded to a Whisper transcription backend. It includes:/v1/audio/transcriptions
file
(UploadFile)model
(e.g.openai/whisper-small
)prompt
(optional)response_format
(json
,text
,srt
,verbose_json
, orvtt
)temperature
(optional)language
(default"en"
)multipart/form-data
payloadrun-router.sh
to spin up a static‐mode router pointing at a local Whisper serverFixes #410
Testing
Local / RunPod (static mode)
Kubernetes (production mode)
Switch to
--service-discovery k8s
(or dynamic JSON config) so real vLLM pods are picked up automatically instead of the static backend.Checklist
pre-commit run --all-files
.-s
.[Feat]
.Detailed Checklist (Click to Expand)
Before submitting, please ensure the PR meets the following:
PR Title and Classification
[Feat]
since this introduces a new feature.Code Quality
pre-commit run --all-files
).DCO & Signed-off
Signed-off-by:
trailer agreeing to the DCO.git commit -s
for automatic sign-off.