-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
Add return_token_ids parameter to OpenAI API endpoints #22587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add return_token_ids parameter to OpenAI API endpoints #22587
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the return_token_ids_alongside
parameter to the OpenAI-compatible endpoints for chat and text completions. This is a well-motivated feature, particularly for agent-based reinforcement learning scenarios where having direct access to token IDs for both prompts and responses is essential. The implementation correctly adds the new parameter to the request models and populates the corresponding token ID fields in the response models. My main feedback is the absence of tests. While the changes appear correct, adding comprehensive tests is necessary to validate the new functionality and ensure long-term maintainability.
vllm/entrypoints/openai/protocol.py
Outdated
return_token_ids_alongside: Optional[bool] = Field( | ||
default=False, | ||
description=( | ||
"If specified, the result will include both prompt and response " | ||
"token ids alongside the generated text. " | ||
"This is useful for debugging or when you " | ||
"need to map generated text back to input tokens." | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces a valuable feature for agent-based scenarios. However, it currently lacks tests. Adding unit and integration tests is crucial to ensure the new return_token_ids_alongside
parameter works as expected across all affected endpoints (/v1/chat/completions
and /v1/completions
) and to prevent future regressions. Please add tests that cover both streaming and non-streaming responses, and verify that the token IDs for both prompts and responses are correctly returned when the flag is enabled, and not returned when disabled.
- Add optional return_token_ids_alongside parameter to ChatCompletionRequest and CompletionRequest - Include token_ids and prompt_token_ids fields in response models when requested - Implement conditional logic in serving endpoints to return token IDs alongside generated text - Useful for debugging and agent scenarios where token-level tracing is needed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
unsubscribe |
2954f14
to
48dd2f4
Compare
Split long comment onto multiple lines for better readability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>
Improve the formatting of conditional token_ids and prompt_token_ids assignments to be more concise and readable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @njhill
the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.
the key concern here is, if it adds any overhead when people don't request the token ids output.
in addition, please add some tests to make sure the behavior is tested?
also cc @hmellor do we have any centralized doc to keep track of these non-openai compatible behavior? |
Noe a doc specifically for this, but in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html each API has separated sections for normal and "extra" params. Although, looking at the src this actually excludes the OpenAI arguments completely... |
@youkaichao Thanks for the review. I don't think it adds any overhead (if you are talking about machine overhead, instead of mental overhead), because the variables are already there and I'm just returning it in case a sampling flag is set. I'll add tests. Probably will take me some time to set up the test env. |
I think this is reasonable/useful. I don't like the parameter name Couple of questions:
|
Totally agree that we need a simple, raw, token-in-token-out endpoint! |
There is a
It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make
This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR. |
Yes I guessed that was the reason but
Sounds reasonable
Right, I wasn't suggesting this would replace the OpenAI API, would just be a simpler alternative. And wasn't suggesting it should be tied to this PR! |
Signed-off-by: Yuge Zhang <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
@njhill @youkaichao @hmellor The test work is done. I've brought in support for streaming=True. It's a bit tricky. Please help review. |
I can't see the full logs of the fastcheck here: https://buildkite.com/vllm/fastcheck/builds/34977/steps/canvas?jid=01989c1a-a53e-4511-b53f-2f4dfb61d9ba Is it related to the changes I've made? |
Can you merge from main? It should resolve the CI failure |
…de-feature Signed-off-by: Yuge Zhang <[email protected]>
I think the newly added test went well. Related logs:
I failed to find failed tests though. XFAIL doesn't matter I guess? I got some SUBFAIL like this:
I think it's related to the API schema change? How do I properly update the schema (i.e., openapi.json)? And why does API like rerank fail? I didn't touch them at all. |
Looks like a connection error, let me retry |
No. Still no luck. The two errors are:
I'm still getting the 16 connection failures. |
The failing test about |
@DarkLight1337 Thank you. If there is any more action that needs to be done on my side, please let me know! |
That issue should be fixed in latest main so you can try merging from main again |
…de-feature Signed-off-by: Yuge Zhang <[email protected]>
Still:
A lot of connection errors.
|
Can you run those tests locally and see if they fail as well? |
# has_echoed[i] is reused here to indicate whether | ||
# we have already returned the prompt token IDs. | ||
if not has_echoed[i]: | ||
prompt_token_ids_to_return = prompt_token_ids | ||
has_echoed[i] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we always return the prompt_token_ids
if return_token_ids
is set, even if echo
isn't set?
Maybe it would be better to only return them if echo
is True
?
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Getting the token IDs (not logprobs) from the vllm inference responses is very important for agent RL-training scenarios, especially when the agent loop rely on vLLM OpenAI endpoint as a fast server to perform rollouts and collect trajectories like
[(prompt_token_ids, response_token_ids, reward), ...]
. The agents need the raw prompts and responses as strings. They also need to track the under-the-hood tokens, so that the RL algorithms can use them for optimization.When I authored agent-lightning in the first place, it's very hard to get the token ids and response ids from vLLM, despite the fact that they are set as local variables in openai-compatible server implementation. This leads me to write a monkey-patch, which essentially modified
OpenAIServingChat.chat_completion_full_generator
to capture the input token ids and output token ids. The code is here, not a long code:https://github.com/microsoft/agent-lightning/blob/24d590f4ea135bd88d8d3c5299526b7d5866b100/agentlightning/instrumentation/vllm.py
Recently, I found that vllm has supported
prompt_logprobs
andreturn_tokens_as_token_ids
as additional parameters to the chat completion API. Throw I don't need logprobs, I thought it would be wonderful to have the token ids from logprobs. But it turns out different from what I thought. I tested with Qwen2.5-0.5B-Instruct,prompt_logprobs
is giving me different results fromprompt_token_ids
:For responses, the returned "token:12345" look okay with
return_tokens_as_token_ids
on. It's a little unstraightforward though, to parse the integer from a string like "token:xxxx".So, this PR adds the token ids alongside the prompts and responses.
Update: rename as return_token_ids.
Test Plan
Unit tests added.
Test Result
Passed locally.
(Optional) Documentation Update
In code descriptions.