Add return_token_ids parameter to OpenAI API endpoints #22587

ultmaster · 2025-08-10T08:39:39Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Getting the token IDs (not logprobs) from the vllm inference responses is very important for agent RL-training scenarios, especially when the agent loop rely on vLLM OpenAI endpoint as a fast server to perform rollouts and collect trajectories like [(prompt_token_ids, response_token_ids, reward), ...]. The agents need the raw prompts and responses as strings. They also need to track the under-the-hood tokens, so that the RL algorithms can use them for optimization.

When I authored agent-lightning in the first place, it's very hard to get the token ids and response ids from vLLM, despite the fact that they are set as local variables in openai-compatible server implementation. This leads me to write a monkey-patch, which essentially modified OpenAIServingChat.chat_completion_full_generator to capture the input token ids and output token ids. The code is here, not a long code:

https://github.com/microsoft/agent-lightning/blob/24d590f4ea135bd88d8d3c5299526b7d5866b100/agentlightning/instrumentation/vllm.py

Recently, I found that vllm has supported prompt_logprobs and return_tokens_as_token_ids as additional parameters to the chat completion API. Throw I don't need logprobs, I thought it would be wonderful to have the token ids from logprobs. But it turns out different from what I thought. I tested with Qwen2.5-0.5B-Instruct, prompt_logprobs is giving me different results from prompt_token_ids:

prompt token ids: [151644, 8948, 198, 2610, 525, …
prompt log probs:
[None, // the first token is missing
{8948: Logprob(logprob=-12.825027465820312, rank=12784, decoded_token='system'), 72030: Logprob(logprob=-0.9812774658203125, rank=1, decoded_token='/API')}, // ??? why two tokens here?
{198: Logprob(logprob=-1.8129281997680664, rank=1, decoded_token='\n')},
{2610: Logprob(logprob=-7.753974914550781, rank=273, decoded_token='You'), 2: Logprob(logprob=-2.9414749145507812, rank=1, decoded_token='#')}, // two tokens here too
{525: Logprob(logprob=-0.28957295417785645, rank=1, decoded_token=' are')}, …

For responses, the returned "token:12345" look okay with return_tokens_as_token_ids on. It's a little unstraightforward though, to parse the integer from a string like "token:xxxx".

So, this PR adds the token ids alongside the prompts and responses.

Update: rename as return_token_ids.

Test Plan

Unit tests added.

Test Result

Passed locally.

(Optional) Documentation Update

In code descriptions.

github-actions · 2025-08-10T08:39:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces the return_token_ids_alongside parameter to the OpenAI-compatible endpoints for chat and text completions. This is a well-motivated feature, particularly for agent-based reinforcement learning scenarios where having direct access to token IDs for both prompts and responses is essential. The implementation correctly adds the new parameter to the request models and populates the corresponding token ID fields in the response models. My main feedback is the absence of tests. While the changes appear correct, adding comprehensive tests is necessary to validate the new functionality and ensure long-term maintainability.

gemini-code-assist · 2025-08-10T08:40:43Z

vllm/entrypoints/openai/protocol.py

+    return_token_ids_alongside: Optional[bool] = Field(
+        default=False,
+        description=(
+            "If specified, the result will include both prompt and response "
+            "token ids alongside the generated text. "
+            "This is useful for debugging or when you "
+            "need to map generated text back to input tokens."
+        )
+    )


This pull request introduces a valuable feature for agent-based scenarios. However, it currently lacks tests. Adding unit and integration tests is crucial to ensure the new return_token_ids_alongside parameter works as expected across all affected endpoints (/v1/chat/completions and /v1/completions) and to prevent future regressions. Please add tests that cover both streaming and non-streaming responses, and verify that the token IDs for both prompts and responses are correctly returned when the flag is enabled, and not returned when disabled.

- Add optional return_token_ids_alongside parameter to ChatCompletionRequest and CompletionRequest - Include token_ids and prompt_token_ids fields in response models when requested - Implement conditional logic in serving endpoints to return token IDs alongside generated text - Useful for debugging and agent scenarios where token-level tracing is needed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>

Signed-off-by: Yuge Zhang <[email protected]>

CharlyWNot · 2025-08-10T08:41:26Z

unsubscribe

Split long comment onto multiple lines for better readability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>

Improve the formatting of conditional token_ids and prompt_token_ids assignments to be more concise and readable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>

youkaichao

cc @njhill

the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.

the key concern here is, if it adds any overhead when people don't request the token ids output.

in addition, please add some tests to make sure the behavior is tested?

youkaichao · 2025-08-11T06:16:59Z

also cc @hmellor do we have any centralized doc to keep track of these non-openai compatible behavior?

hmellor · 2025-08-11T06:41:52Z

Noe a doc specifically for this, but in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html each API has separated sections for normal and "extra" params.

Although, looking at the src this actually excludes the OpenAI arguments completely...

ultmaster · 2025-08-11T06:49:06Z

cc @njhill

the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.

the key concern here is, if it adds any overhead when people don't request the token ids output.

in addition, please add some tests to make sure the behavior is tested?

@youkaichao Thanks for the review.

I don't think it adds any overhead (if you are talking about machine overhead, instead of mental overhead), because the variables are already there and I'm just returning it in case a sampling flag is set.

I'll add tests. Probably will take me some time to set up the test env.

njhill · 2025-08-11T18:26:43Z

I think this is reasonable/useful.

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

Couple of questions:

Should we have a way to request only prompt token ids and/or output token ids?
WDYT having an additional simpler "raw" non-OpenAI API endpoint?

KuntaiDu · 2025-08-11T21:23:21Z

I think this is reasonable/useful.

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

Couple of questions:

Should we have a way to request only prompt token ids and/or output token ids?

WDYT having an additional simpler "raw" non-OpenAI API endpoint?

Totally agree that we need a simple, raw, token-in-token-out endpoint!

ultmaster · 2025-08-12T00:35:30Z

@njhill

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

There is a return_tokens_as_token_ids (which is related to the output of logprob), so I use return_token_ids_alongside to distinguish. I have no personal preferences though so as you wish: return_token_ids.

Should we have a way to request only prompt token ids and/or output token ids?

It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make return_token_ids a Union[bool, Literal["prompt", "response"]] to further control the behavior.

WDYT having an additional simpler "raw" non-OpenAI API endpoint?

This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR.

njhill · 2025-08-12T00:56:19Z

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

There is a return_tokens_as_token_ids (which is related to the output of logprob), so I use return_token_ids_alongside to distinguish. I have no personal preferences though so as you wish: return_token_ids.

Yes I guessed that was the reason but return_token_ids is different enough imo!

Should we have a way to request only prompt token ids and/or output token ids?

It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make return_token_ids a Union[bool, Literal["prompt", "response"]] to further control the behavior.

Sounds reasonable

WDYT having an additional simpler "raw" non-OpenAI API endpoint?

This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR.

Right, I wasn't suggesting this would replace the OpenAI API, would just be a simpler alternative. And wasn't suggesting it should be tied to this PR!

Signed-off-by: Yuge Zhang <[email protected]>

ultmaster · 2025-08-12T02:28:25Z

@njhill @youkaichao @hmellor The test work is done.

I've brought in support for streaming=True. It's a bit tricky. Please help review.

ultmaster · 2025-08-12T06:54:36Z

I can't see the full logs of the fastcheck here: https://buildkite.com/vllm/fastcheck/builds/34977/steps/canvas?jid=01989c1a-a53e-4511-b53f-2f4dfb61d9ba

Is it related to the changes I've made?

DarkLight1337 · 2025-08-12T07:15:16Z

Can you merge from main? It should resolve the CI failure

…de-feature Signed-off-by: Yuge Zhang <[email protected]>

ultmaster · 2025-08-12T09:43:01Z

I think the newly added test went well.

Related logs:

[2025-08-12T08:42:10Z] entrypoints/openai/test_return_token_ids.py::test_basic_completion_with_emoji INFO 08-12 01:42:10 [__init__.py:707] Resolved architecture: Qwen2ForCausalLM
[2025-08-12T08:42:10Z] INFO 08-12 01:42:10 [__init__.py:1735] Using max model len 2048
[2025-08-12T08:42:10Z] INFO 08-12 01:42:10 [weight_utils.py:296] Using model weights format ['*.safetensors']
[2025-08-12T08:42:11Z] INFO 08-12 01:42:11 [weight_utils.py:349] No model.safetensors.index.json found in remote.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:28] No plugins for group vllm.platform_plugins found.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:34] Checking if TPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:52] TPU platform is not available because: No module named 'libtpu'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:106] Checking if ROCm platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:120] ROCm platform is not available because: No module named 'amdsmi'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:127] Checking if XPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:153] Checking if CPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:175] Checking if Neuron platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:15Z] INFO 08-12 01:42:15 [__init__.py:241] Automatically detected platform cuda.
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:36] Available plugins for group vllm.general_plugins:
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2025-08-12T08:42:17Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:17 [api_server.py:1805] vLLM API server version 0.10.1.dev566+g8c565a836
[2025-08-12T08:42:17Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:17 [utils.py:326] non-default args: {'model_tag': 'Qwen/Qwen2.5-1.5B-Instruct', 'port': 43465, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': 'Qwen/Qwen2.5-1.5B-Instruct', 'seed': 0, 'max_model_len': 2048, 'enforce_eager': True, 'gpu_memory_utilization': 0.7, 'max_num_seqs': 128}
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:707] Resolved architecture: Qwen2ForCausalLM
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:1735] Using max model len 2048
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:24 [arg_utils.py:1714] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:2035] Chunked prefill is enabled with max_num_batched_tokens=2048.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:28] No plugins for group vllm.platform_plugins found.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:34] Checking if TPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:52] TPU platform is not available because: No module named 'libtpu'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:106] Checking if ROCm platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:120] ROCm platform is not available because: No module named 'amdsmi'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:127] Checking if XPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:153] Checking if CPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:175] Checking if Neuron platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:29Z] INFO 08-12 01:42:29 [__init__.py:241] Automatically detected platform cuda.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:31 [core.py:619] Waiting for init message from front-end.
[2025-08-12T08:42:31Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:31 [utils.py:831] HELLO from local core engine process 0.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [core.py:627] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/8de505d0-7481-4ca3-a091-c654513aed54'], outputs=['ipc:///tmp/6b25092c-58ca-43f7-8e5e-bd05615f1b0e'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, 'data_parallel_size': 1})
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [core.py:464] Has DP Coordinator: False, stats publish address: None
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:36] Available plugins for group vllm.general_plugins:
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:31 [core.py:72] Initializing a V1 LLM engine (v0.10.1.dev566+g8c565a836) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [decorators.py:139] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [decorators.py:139] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m WARNING 08-12 01:42:31 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m WARNING 08-12 01:42:31 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:3043] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa61203e540>
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [parallel_state.py:976] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.0.2:33891 backend=nccl
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [parallel_state.py:1027] Detected 1 nodes in the distributed environment
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [gpu_model_runner.py:1936] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [gpu_model_runner.py:1968] Loading model from scratch...
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [cuda.py:327] Using Flash Attention backend on V1 engine.
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter({'rms_norm': 57, 'silu_and_mul': 28, 'rotary_embedding': 1})
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [base_loader.py:47] Loading weights on cuda ...
[2025-08-12T08:42:33Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:33 [weight_utils.py:296] Using model weights format ['*.safetensors']
[2025-08-12T08:42:33Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:33 [weight_utils.py:349] No model.safetensors.index.json found in remote.
[2025-08-12T08:42:38Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m 
Loading safetensors checkpoint shards:   0% 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% 1/1 [00:05<00:00,  5.34s/it]
Loading safetensors checkpoint shards: 100% 1/1 [00:05<00:00,  5.34s/it]
[2025-08-12T08:42:38Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:38 [default_loader.py:262] Loading weights took 5.43 seconds
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:39 [gpu_model_runner.py:1985] Model loading took 2.8871 GiB and 5.965132 seconds
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m   warnings.warn(
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:262] Initial free memory: 21.58 GiB; Requested memory: 0.70 (util), 15.43 GiB
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:269] Free memory after profiling: 18.52 GiB (total), 12.37 GiB (within requested)
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:275] Memory profiling takes 0.77 seconds. Total non KV cache memory: 3.17GiB; torch peak memory increase: 0.27GiB; non-torch forward increase memory: 0.02GiB; weights memory: 2.89GiB.
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [gpu_worker.py:276] Available KV cache memory: 12.26 GiB
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [kv_cache_utils.py:829] GPU KV cache size: 459,152 tokens
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [kv_cache_utils.py:833] Maximum concurrency for 2,048 tokens per request: 224.20x
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [__init__.py:4120] enabled custom ops: Counter({'rms_norm': 57, 'silu_and_mul': 28, 'rotary_embedding': 1})
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [core.py:199] init engine (profile, create kv cache, warmup model) took 1.26 seconds
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:41 [utils.py:831] READY from local core engine process 0.
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 28697
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [api_server.py:1611] Supported_tasks: ['generate']
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m WARNING 08-12 01:42:41 [__init__.py:1610] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_responses.py:120] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_responses.py:149] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_chat.py:93] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_chat.py:133] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_completion.py:77] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:43465
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:29] Available routes are:
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /docs, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /health, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /load, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /ping, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /ping, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /tokenize, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /detokenize, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/models, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /version, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/completions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/embeddings, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /pooling, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /classify, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /score, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/score, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v2/rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /invocations, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /metrics, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Started server process [�[36m13457�[0m]
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Waiting for application startup.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Application startup complete.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:42 [async_llm.py:557] Called check_health.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50620 - "�[1mGET /health HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:42Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:42 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:42Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:42 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50624 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:43Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:43 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:43Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:43 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:43Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50624 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:44Z] �[32mPASSED�[0m
[2025-08-12T08:42:44Z] entrypoints/openai/test_return_token_ids.py::test_chat_completion_with_tool_use �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:44 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[2025-08-12T08:42:44Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:44 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:44Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:44 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:44Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50626 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:45Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:45 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:45Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:45 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:45Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50626 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:46Z] �[32mPASSED�[0m
[2025-08-12T08:42:46Z] entrypoints/openai/test_return_token_ids.py::test_comparison_with_prompt_logprobs_and_logprobs �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:46 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:46Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:46 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:46Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39300 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:47Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39300 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:47Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:47 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:47Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:47 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:48Z] �[32mPASSED�[0m
[2025-08-12T08:42:49Z] entrypoints/openai/test_return_token_ids.py::test_chat_completion_with_emoji_and_token_ids �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:49 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:49Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:49 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:49Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39312 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:50Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39312 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:50Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:50 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:50Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:50 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:51Z] �[32mPASSED�[0m�[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:51 [launcher.py:77] port 43465 is used by process psutil.Process(pid=13457, name='vllm', status='running', started='01:42:10') launched with command:

[2025-08-12T08:42:51Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:51 [launcher.py:77] /usr/bin/python3 /usr/local/bin/vllm serve Qwen/Qwen2.5-1.5B-Instruct --max-model-len 2048 --max-num-seqs 128 --enable-auto-tool-choice --tool-call-parser hermes --enforce-eager --gpu-memory-utilization 0.7 --port 43465 --seed 0
[2025-08-12T08:42:51Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:51 [launcher.py:80] Shutting down FastAPI HTTP server.
[2025-08-12T08:42:51Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:51 [core.py:679] EngineCore exiting.
[2025-08-12T08:42:51Z] [rank0]:[W812 01:42:51.638223122 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Shutting down
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:52 [loggers.py:123] Engine 000: Avg prompt throughput: 47.1 tokens/s, Avg generation throughput: 12.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 43.6%
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Waiting for application shutdown.
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Application shutdown complete.
[2025-08-12T08:42:52Z]

I failed to find failed tests though. XFAIL doesn't matter I guess? I got some SUBFAIL like this:

[2025-08-12T08:36:31Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/chat/completions] �[31m(verbose_name='POST /v1/chat/completions') SUBFAIL�[0m
[2025-08-12T08:36:34Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/completions] �[31m(verbose_name='POST /v1/completions') SUBFAIL�[0m
[2025-08-12T08:36:35Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/embeddings] �[31m(verbose_name='POST /v1/embeddings') SUBFAIL�[0m
[2025-08-12T08:36:37Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /pooling] �[31m(verbose_name='POST /pooling') SUBFAIL�[0m
[2025-08-12T08:36:39Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /classify] �[31m(verbose_name='POST /classify') SUBFAIL�[0m
[2025-08-12T08:36:42Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /score] �[31m(verbose_name='POST /score') SUBFAIL�[0m
[2025-08-12T08:36:43Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/score] �[31m(verbose_name='POST /v1/score') SUBFAIL�[0m
[2025-08-12T08:36:44Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/audio/transcriptions] �[31m(verbose_name='POST /v1/audio/transcriptions') SUBFAIL�[0m
[2025-08-12T08:36:47Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/audio/translations] �[31m(verbose_name='POST /v1/audio/translations') SUBFAIL�[0m
[2025-08-12T08:36:50Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /rerank] �[31m(verbose_name='POST /rerank') SUBFAIL�[0m
[2025-08-12T08:36:53Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/rerank] �[31m(verbose_name='POST /v1/rerank') SUBFAIL�[0m
[2025-08-12T08:36:53Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v2/rerank] �[31m(verbose_name='POST /v2/rerank') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /scale_elastic_ep] �[31m(verbose_name='POST /scale_elastic_ep') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /is_scaling_elastic_ep] �[31m(verbose_name='POST /is_scaling_elastic_ep') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /invocations] �[31m(verbose_name='POST /invocations') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[GET /metrics] �[31m(verbose_name='GET /metrics') SUBFAIL�[0m

I think it's related to the API schema change? How do I properly update the schema (i.e., openapi.json)? And why does API like rerank fail? I didn't touch them at all.

DarkLight1337 · 2025-08-12T09:45:31Z

Looks like a connection error, let me retry

ultmaster · 2025-08-12T11:31:00Z

No. Still no luck.

The two errors are:

[2025-08-12T11:25:58Z] ==================================== ERRORS ====================================
[2025-08-12T11:25:58Z] _ ERROR at setup of test_single_request[True-christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM] _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @pytest.fixture(scope="module")
[2025-08-12T11:25:58Z]     def server():
[2025-08-12T11:25:58Z]         args = [
[2025-08-12T11:25:58Z]             "--runner",
[2025-08-12T11:25:58Z]             "pooling",
[2025-08-12T11:25:58Z]             # use half precision for speed and memory savings in CI environment
[2025-08-12T11:25:58Z]             "--dtype",
[2025-08-12T11:25:58Z]             DTYPE,
[2025-08-12T11:25:58Z]             "--enforce-eager",
[2025-08-12T11:25:58Z]             "--trust-remote-code",
[2025-08-12T11:25:58Z]             "--skip-tokenizer-init",
[2025-08-12T11:25:58Z]             "--max-num-seqs",
[2025-08-12T11:25:58Z]             "32"
[2025-08-12T11:25:58Z]         ]
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] >       with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] entrypoints/openai/test_skip_tokenizer.py:41:
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z] utils.py:144: in __init__
[2025-08-12T11:25:58Z]     self._wait_for_server(url=self.url_for("health"),
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] self = <tests.utils.RemoteOpenAIServer object at 0x7f2e0fec34d0>
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     def _wait_for_server(self, *, url: str, timeout: float):
[2025-08-12T11:25:58Z]         # run health check
[2025-08-12T11:25:58Z]         start = time.time()
[2025-08-12T11:25:58Z]         client = (httpx.Client(transport=httpx.HTTPTransport(
[2025-08-12T11:25:58Z]             uds=self.uds)) if self.uds else requests)
[2025-08-12T11:25:58Z]         while True:
[2025-08-12T11:25:58Z]             try:
[2025-08-12T11:25:58Z]                 if client.get(url).status_code == 200:
[2025-08-12T11:25:58Z]                     break
[2025-08-12T11:25:58Z]             except Exception:
[2025-08-12T11:25:58Z]                 # this exception can only be raised by requests.get,
[2025-08-12T11:25:58Z]                 # which means the server is not ready yet.
[2025-08-12T11:25:58Z]                 # the stack trace is not useful, so we suppress it
[2025-08-12T11:25:58Z]                 # by using `raise from None`.
[2025-08-12T11:25:58Z]                 result = self.proc.poll()
[2025-08-12T11:25:58Z]                 if result is not None and result != 0:
[2025-08-12T11:25:58Z] >                   raise RuntimeError("Server exited unexpectedly.") from None
[2025-08-12T11:25:58Z] E                   RuntimeError: Server exited unexpectedly.
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] utils.py:174: RuntimeError
[2025-08-12T11:25:58Z] ------------------------------ Captured log setup ------------------------------
[2025-08-12T11:25:58Z] WARNING  transformers.configuration_utils:configuration_utils.py:684 The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2025-08-12T11:25:58Z] _ ERROR at setup of test_single_request[False-christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM] _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @pytest.fixture(scope="module")
[2025-08-12T11:25:58Z]     def server():
[2025-08-12T11:25:58Z]         args = [
[2025-08-12T11:25:58Z]             "--runner",
[2025-08-12T11:25:58Z]             "pooling",
[2025-08-12T11:25:58Z]             # use half precision for speed and memory savings in CI environment
[2025-08-12T11:25:58Z]             "--dtype",
[2025-08-12T11:25:58Z]             DTYPE,
[2025-08-12T11:25:58Z]             "--enforce-eager",
[2025-08-12T11:25:58Z]             "--trust-remote-code",
[2025-08-12T11:25:58Z]             "--skip-tokenizer-init",
[2025-08-12T11:25:58Z]             "--max-num-seqs",
[2025-08-12T11:25:58Z]             "32"
[2025-08-12T11:25:58Z]         ]
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] >       with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] entrypoints/openai/test_skip_tokenizer.py:41:
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z] utils.py:144: in __init__
[2025-08-12T11:25:58Z]     self._wait_for_server(url=self.url_for("health"),
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] self = <tests.utils.RemoteOpenAIServer object at 0x7f2e0fec34d0>
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     def _wait_for_server(self, *, url: str, timeout: float):
[2025-08-12T11:25:58Z]         # run health check
[2025-08-12T11:25:58Z]         start = time.time()
[2025-08-12T11:25:58Z]         client = (httpx.Client(transport=httpx.HTTPTransport(
[2025-08-12T11:25:58Z]             uds=self.uds)) if self.uds else requests)
[2025-08-12T11:25:58Z]         while True:
[2025-08-12T11:25:58Z]             try:
[2025-08-12T11:25:58Z]                 if client.get(url).status_code == 200:
[2025-08-12T11:25:58Z]                     break
[2025-08-12T11:25:58Z]             except Exception:
[2025-08-12T11:25:58Z]                 # this exception can only be raised by requests.get,
[2025-08-12T11:25:58Z]                 # which means the server is not ready yet.
[2025-08-12T11:25:58Z]                 # the stack trace is not useful, so we suppress it
[2025-08-12T11:25:58Z]                 # by using `raise from None`.
[2025-08-12T11:25:58Z]                 result = self.proc.poll()
[2025-08-12T11:25:58Z]                 if result is not None and result != 0:
[2025-08-12T11:25:58Z] >                   raise RuntimeError("Server exited unexpectedly.") from None
[2025-08-12T11:25:58Z] E                   RuntimeError: Server exited unexpectedly.
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] utils.py:174: RuntimeError
[2025-08-12T11:25:58Z] =================================== FAILURES ===================================
[2025-08-12T11:25:58Z] ______ test_openapi_stateless (verbose_name='POST /v1/chat/completions') _______
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @wraps(test)
[2025-08-12T11:25:58Z] >   def test_function(*args: Any, **kwargs: Any) -> Any:
[2025-08-12T11:25:58Z] E   schemathesis.exceptions.CheckFailed: Schemathesis found 2 distinct sets of failures.
[2025-08-12T11:25:58Z] E   ====================
[2025-08-12T11:25:58Z] E   
[2025-08-12T11:25:58Z] E   self = <OpenApi30 for FastAPI 0.1.0>
[2025-08-12T11:25:58Z] E   operation = APIOperation(path='/v1/chat/completions', method='post', schema=<OpenApi30 for FastAPI 0.1.0>, verbose_name='POST /v1/...est'}}, media_type='application/json', required=True, description=None)]), case_cls=<class 'schemathesis.models.Case'>)
[2025-08-12T11:25:58Z] E   response = <Response [500]>
[2025-08-12T11:25:58Z] E   
[2025-08-12T11:25:58Z] E       def validate_response(self, operation: APIOperation, response: GenericResponse) -> bool | None:
[2025-08-12T11:25:58Z] E           responses = {str(key): value for key, value in operation.definition.raw.get("responses", {}).items()}
[2025-08-12T11:25:58Z] E           status_code = str(response.status_code)
[2025-08-12T11:25:58Z] E           if status_code in responses:
[2025-08-12T11:25:58Z] E               definition = responses[status_code]
[2025-08-12T11:25:58Z] E           elif "default" in responses:

I'm still getting the 16 connection failures.

DarkLight1337 · 2025-08-12T12:36:36Z

The failing test about christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM is not caused by this PR. Maybe the rest failed because the memory didn't get cleared after the first failure

ultmaster · 2025-08-12T12:39:14Z

@DarkLight1337 Thank you. If there is any more action that needs to be done on my side, please let me know!

DarkLight1337 · 2025-08-12T12:43:26Z

That issue should be fixed in latest main so you can try merging from main again

…de-feature Signed-off-by: Yuge Zhang <[email protected]>

ultmaster · 2025-08-12T15:40:58Z

Still:

[2025-08-12T15:07:59Z] = 16 failed, 523 passed, 32 skipped, 1 xfailed, 48 warnings, 11 subtests passed in 5579.35s (1:32:59) =

A lot of connection errors.

christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM seems to be fixed.

DarkLight1337 · 2025-08-12T16:49:02Z

Can you run those tests locally and see if they fail as well?

njhill · 2025-08-12T20:03:14Z

vllm/entrypoints/openai/serving_completion.py

+                        # has_echoed[i] is reused here to indicate whether
+                        # we have already returned the prompt token IDs.
+                        if not has_echoed[i]:
+                            prompt_token_ids_to_return = prompt_token_ids
+                            has_echoed[i] = True


So we always return the prompt_token_ids if return_token_ids is set, even if echo isn't set?

Maybe it would be better to only return them if echo is True?

ultmaster requested a review from aarnphm as a code owner August 10, 2025 08:39

mergify bot added the frontend label Aug 10, 2025

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

ultmaster and others added 2 commits August 10, 2025 16:41

Add comment

48dd2f4

Signed-off-by: Yuge Zhang <[email protected]>

ultmaster force-pushed the add-token-ids-alongside-feature branch from 2954f14 to 48dd2f4 Compare August 10, 2025 08:42

ultmaster and others added 2 commits August 10, 2025 16:51

Improve comment formatting for token_ids field

4f6ea7f

Split long comment onto multiple lines for better readability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yuge Zhang <[email protected]>

youkaichao reviewed Aug 11, 2025

View reviewed changes

ultmaster added 2 commits August 12, 2025 09:19

Works for non-streaming case

81678bf

Signed-off-by: Yuge Zhang <[email protected]>

Works for streaming case

4f6801e

Signed-off-by: Yuge Zhang <[email protected]>

ultmaster requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners August 12, 2025 02:26

ultmaster changed the title ~~Add return_token_ids_alongside parameter to OpenAI API endpoints~~ Add return_token_ids parameter to OpenAI API endpoints Aug 12, 2025

Merge remote-tracking branch 'origin/main' into add-token-ids-alongsi…

8c565a8

…de-feature Signed-off-by: Yuge Zhang <[email protected]>

Merge remote-tracking branch 'origin/main' into add-token-ids-alongsi…

ab6d7ef

…de-feature Signed-off-by: Yuge Zhang <[email protected]>

njhill reviewed Aug 12, 2025

View reviewed changes

Uh oh!

Add return_token_ids parameter to OpenAI API endpoints #22587

Are you sure you want to change the base?

Add return_token_ids parameter to OpenAI API endpoints #22587

Conversation

ultmaster commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

CharlyWNot commented Aug 10, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Aug 11, 2025

Uh oh!

hmellor commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ultmaster commented Aug 11, 2025

Uh oh!

njhill commented Aug 11, 2025

Uh oh!

KuntaiDu commented Aug 11, 2025

Uh oh!

ultmaster commented Aug 12, 2025

Uh oh!

njhill commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

ultmaster commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

njhill Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ultmaster commented Aug 10, 2025 •

edited by github-actions bot

Loading

hmellor commented Aug 11, 2025 •

edited

Loading

ultmaster commented Aug 12, 2025 •

edited

Loading

ultmaster commented Aug 12, 2025 •

edited

Loading