Skip to content

Add return_token_ids parameter to OpenAI API endpoints #22587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ultmaster
Copy link

@ultmaster ultmaster commented Aug 10, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Getting the token IDs (not logprobs) from the vllm inference responses is very important for agent RL-training scenarios, especially when the agent loop rely on vLLM OpenAI endpoint as a fast server to perform rollouts and collect trajectories like [(prompt_token_ids, response_token_ids, reward), ...]. The agents need the raw prompts and responses as strings. They also need to track the under-the-hood tokens, so that the RL algorithms can use them for optimization.

When I authored agent-lightning in the first place, it's very hard to get the token ids and response ids from vLLM, despite the fact that they are set as local variables in openai-compatible server implementation. This leads me to write a monkey-patch, which essentially modified OpenAIServingChat.chat_completion_full_generator to capture the input token ids and output token ids. The code is here, not a long code:

https://github.com/microsoft/agent-lightning/blob/24d590f4ea135bd88d8d3c5299526b7d5866b100/agentlightning/instrumentation/vllm.py

Recently, I found that vllm has supported prompt_logprobs and return_tokens_as_token_ids as additional parameters to the chat completion API. Throw I don't need logprobs, I thought it would be wonderful to have the token ids from logprobs. But it turns out different from what I thought. I tested with Qwen2.5-0.5B-Instruct, prompt_logprobs is giving me different results from prompt_token_ids:

prompt token ids: [151644, 8948, 198, 2610, 525, …
prompt log probs:
[None, // the first token is missing
{8948: Logprob(logprob=-12.825027465820312, rank=12784, decoded_token='system'), 72030: Logprob(logprob=-0.9812774658203125, rank=1, decoded_token='/API')}, // ??? why two tokens here?
{198: Logprob(logprob=-1.8129281997680664, rank=1, decoded_token='\n')},
{2610: Logprob(logprob=-7.753974914550781, rank=273, decoded_token='You'), 2: Logprob(logprob=-2.9414749145507812, rank=1, decoded_token='#')}, // two tokens here too
{525: Logprob(logprob=-0.28957295417785645, rank=1, decoded_token=' are')}, …

For responses, the returned "token:12345" look okay with return_tokens_as_token_ids on. It's a little unstraightforward though, to parse the integer from a string like "token:xxxx".

So, this PR adds the token ids alongside the prompts and responses.

Update: rename as return_token_ids.

Test Plan

Unit tests added.

Test Result

Passed locally.

(Optional) Documentation Update

In code descriptions.

@ultmaster ultmaster requested a review from aarnphm as a code owner August 10, 2025 08:39
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Aug 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the return_token_ids_alongside parameter to the OpenAI-compatible endpoints for chat and text completions. This is a well-motivated feature, particularly for agent-based reinforcement learning scenarios where having direct access to token IDs for both prompts and responses is essential. The implementation correctly adds the new parameter to the request models and populates the corresponding token ID fields in the response models. My main feedback is the absence of tests. While the changes appear correct, adding comprehensive tests is necessary to validate the new functionality and ensure long-term maintainability.

Comment on lines 570 to 578
return_token_ids_alongside: Optional[bool] = Field(
default=False,
description=(
"If specified, the result will include both prompt and response "
"token ids alongside the generated text. "
"This is useful for debugging or when you "
"need to map generated text back to input tokens."
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This pull request introduces a valuable feature for agent-based scenarios. However, it currently lacks tests. Adding unit and integration tests is crucial to ensure the new return_token_ids_alongside parameter works as expected across all affected endpoints (/v1/chat/completions and /v1/completions) and to prevent future regressions. Please add tests that cover both streaming and non-streaming responses, and verify that the token IDs for both prompts and responses are correctly returned when the flag is enabled, and not returned when disabled.

ultmaster and others added 2 commits August 10, 2025 16:41
- Add optional return_token_ids_alongside parameter to ChatCompletionRequest and CompletionRequest
- Include token_ids and prompt_token_ids fields in response models when requested
- Implement conditional logic in serving endpoints to return token IDs alongside generated text
- Useful for debugging and agent scenarios where token-level tracing is needed

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
@CharlyWNot
Copy link

unsubscribe

@ultmaster ultmaster force-pushed the add-token-ids-alongside-feature branch from 2954f14 to 48dd2f4 Compare August 10, 2025 08:42
ultmaster and others added 2 commits August 10, 2025 16:51
Split long comment onto multiple lines for better readability.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
Improve the formatting of conditional token_ids and prompt_token_ids
assignments to be more concise and readable.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yuge Zhang <[email protected]>
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njhill

the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.

the key concern here is, if it adds any overhead when people don't request the token ids output.

in addition, please add some tests to make sure the behavior is tested?

@youkaichao
Copy link
Member

also cc @hmellor do we have any centralized doc to keep track of these non-openai compatible behavior?

@hmellor
Copy link
Member

hmellor commented Aug 11, 2025

Noe a doc specifically for this, but in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html each API has separated sections for normal and "extra" params.

Although, looking at the src this actually excludes the OpenAI arguments completely...

@ultmaster
Copy link
Author

cc @njhill

the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.

the key concern here is, if it adds any overhead when people don't request the token ids output.

in addition, please add some tests to make sure the behavior is tested?

@youkaichao Thanks for the review.

I don't think it adds any overhead (if you are talking about machine overhead, instead of mental overhead), because the variables are already there and I'm just returning it in case a sampling flag is set.

I'll add tests. Probably will take me some time to set up the test env.

@njhill
Copy link
Member

njhill commented Aug 11, 2025

I think this is reasonable/useful.

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

Couple of questions:

  • Should we have a way to request only prompt token ids and/or output token ids?
  • WDYT having an additional simpler "raw" non-OpenAI API endpoint?

@KuntaiDu
Copy link
Collaborator

I think this is reasonable/useful.

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

Couple of questions:

  • Should we have a way to request only prompt token ids and/or output token ids?
  • WDYT having an additional simpler "raw" non-OpenAI API endpoint?

Totally agree that we need a simple, raw, token-in-token-out endpoint!

@ultmaster
Copy link
Author

@njhill

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

There is a return_tokens_as_token_ids (which is related to the output of logprob), so I use return_token_ids_alongside to distinguish. I have no personal preferences though so as you wish: return_token_ids.

Should we have a way to request only prompt token ids and/or output token ids?

It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make return_token_ids a Union[bool, Literal["prompt", "response"]] to further control the behavior.

WDYT having an additional simpler "raw" non-OpenAI API endpoint?

This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR.

@njhill
Copy link
Member

njhill commented Aug 12, 2025

I don't like the parameter name return_token_ids_alongside though, perhaps it can be just return_token_ids?

There is a return_tokens_as_token_ids (which is related to the output of logprob), so I use return_token_ids_alongside to distinguish. I have no personal preferences though so as you wish: return_token_ids.

Yes I guessed that was the reason but return_token_ids is different enough imo!

Should we have a way to request only prompt token ids and/or output token ids?

It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make return_token_ids a Union[bool, Literal["prompt", "response"]] to further control the behavior.

Sounds reasonable

WDYT having an additional simpler "raw" non-OpenAI API endpoint?

This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR.

Right, I wasn't suggesting this would replace the OpenAI API, would just be a simpler alternative. And wasn't suggesting it should be tied to this PR!

@ultmaster
Copy link
Author

@njhill @youkaichao @hmellor The test work is done.

I've brought in support for streaming=True. It's a bit tricky. Please help review.

@ultmaster ultmaster changed the title Add return_token_ids_alongside parameter to OpenAI API endpoints Add return_token_ids parameter to OpenAI API endpoints Aug 12, 2025
@ultmaster
Copy link
Author

I can't see the full logs of the fastcheck here: https://buildkite.com/vllm/fastcheck/builds/34977/steps/canvas?jid=01989c1a-a53e-4511-b53f-2f4dfb61d9ba

Is it related to the changes I've made?

@DarkLight1337
Copy link
Member

Can you merge from main? It should resolve the CI failure

@ultmaster
Copy link
Author

ultmaster commented Aug 12, 2025

I think the newly added test went well.

Related logs:
[2025-08-12T08:42:10Z] entrypoints/openai/test_return_token_ids.py::test_basic_completion_with_emoji INFO 08-12 01:42:10 [__init__.py:707] Resolved architecture: Qwen2ForCausalLM
[2025-08-12T08:42:10Z] INFO 08-12 01:42:10 [__init__.py:1735] Using max model len 2048
[2025-08-12T08:42:10Z] INFO 08-12 01:42:10 [weight_utils.py:296] Using model weights format ['*.safetensors']
[2025-08-12T08:42:11Z] INFO 08-12 01:42:11 [weight_utils.py:349] No model.safetensors.index.json found in remote.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:28] No plugins for group vllm.platform_plugins found.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:34] Checking if TPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:52] TPU platform is not available because: No module named 'libtpu'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:106] Checking if ROCm platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:120] ROCm platform is not available because: No module named 'amdsmi'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:127] Checking if XPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:153] Checking if CPU platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:175] Checking if Neuron platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:15Z] DEBUG 08-12 01:42:15 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:15Z] INFO 08-12 01:42:15 [__init__.py:241] Automatically detected platform cuda.
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:36] Available plugins for group vllm.general_plugins:
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[2025-08-12T08:42:17Z] DEBUG 08-12 01:42:17 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2025-08-12T08:42:17Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:17 [api_server.py:1805] vLLM API server version 0.10.1.dev566+g8c565a836
[2025-08-12T08:42:17Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:17 [utils.py:326] non-default args: {'model_tag': 'Qwen/Qwen2.5-1.5B-Instruct', 'port': 43465, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': 'Qwen/Qwen2.5-1.5B-Instruct', 'seed': 0, 'max_model_len': 2048, 'enforce_eager': True, 'gpu_memory_utilization': 0.7, 'max_num_seqs': 128}
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:707] Resolved architecture: Qwen2ForCausalLM
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:1735] Using max model len 2048
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:24 [arg_utils.py:1714] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
[2025-08-12T08:42:24Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:24 [__init__.py:2035] Chunked prefill is enabled with max_num_batched_tokens=2048.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:28] No plugins for group vllm.platform_plugins found.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:34] Checking if TPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:52] TPU platform is not available because: No module named 'libtpu'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:106] Checking if ROCm platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:120] ROCm platform is not available because: No module named 'amdsmi'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:127] Checking if XPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:153] Checking if CPU platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:175] Checking if Neuron platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:58] Checking if CUDA platform is available.
[2025-08-12T08:42:29Z] DEBUG 08-12 01:42:29 [__init__.py:78] Confirmed CUDA platform is available.
[2025-08-12T08:42:29Z] INFO 08-12 01:42:29 [__init__.py:241] Automatically detected platform cuda.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:31 [core.py:619] Waiting for init message from front-end.
[2025-08-12T08:42:31Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:31 [utils.py:831] HELLO from local core engine process 0.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [core.py:627] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/8de505d0-7481-4ca3-a091-c654513aed54'], outputs=['ipc:///tmp/6b25092c-58ca-43f7-8e5e-bd05615f1b0e'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, 'data_parallel_size': 1})
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [core.py:464] Has DP Coordinator: False, stats publish address: None
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:36] Available plugins for group vllm.general_plugins:
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:31 [core.py:72] Initializing a V1 LLM engine (v0.10.1.dev566+g8c565a836) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [decorators.py:139] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:31 [decorators.py:139] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m WARNING 08-12 01:42:31 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
[2025-08-12T08:42:31Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m WARNING 08-12 01:42:31 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:3043] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa61203e540>
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [parallel_state.py:976] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.0.2:33891 backend=nccl
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [parallel_state.py:1027] Detected 1 nodes in the distributed environment
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [gpu_model_runner.py:1936] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [gpu_model_runner.py:1968] Loading model from scratch...
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:32 [cuda.py:327] Using Flash Attention backend on V1 engine.
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4120] enabled custom ops: Counter({'rms_norm': 57, 'silu_and_mul': 28, 'rotary_embedding': 1})
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:32Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:32 [base_loader.py:47] Loading weights on cuda ...
[2025-08-12T08:42:33Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:33 [weight_utils.py:296] Using model weights format ['*.safetensors']
[2025-08-12T08:42:33Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:33 [weight_utils.py:349] No model.safetensors.index.json found in remote.
[2025-08-12T08:42:38Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m 
Loading safetensors checkpoint shards:   0% 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% 1/1 [00:05<00:00,  5.34s/it]
Loading safetensors checkpoint shards: 100% 1/1 [00:05<00:00,  5.34s/it]
[2025-08-12T08:42:38Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:38 [default_loader.py:262] Loading weights took 5.43 seconds
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:39 [gpu_model_runner.py:1985] Model loading took 2.8871 GiB and 5.965132 seconds
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
[2025-08-12T08:42:39Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m   warnings.warn(
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:262] Initial free memory: 21.58 GiB; Requested memory: 0.70 (util), 15.43 GiB
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:269] Free memory after profiling: 18.52 GiB (total), 12.37 GiB (within requested)
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [gpu_worker.py:275] Memory profiling takes 0.77 seconds. Total non KV cache memory: 3.17GiB; torch peak memory increase: 0.27GiB; non-torch forward increase memory: 0.02GiB; weights memory: 2.89GiB.
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [gpu_worker.py:276] Available KV cache memory: 12.26 GiB
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [kv_cache_utils.py:829] GPU KV cache size: 459,152 tokens
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [kv_cache_utils.py:833] Maximum concurrency for 2,048 tokens per request: 224.20x
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [__init__.py:4120] enabled custom ops: Counter({'rms_norm': 57, 'silu_and_mul': 28, 'rotary_embedding': 1})
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:40 [__init__.py:4122] disabled custom ops: Counter()
[2025-08-12T08:42:40Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m INFO 08-12 01:42:40 [core.py:199] init engine (profile, create kv cache, warmup model) took 1.26 seconds
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:41 [utils.py:831] READY from local core engine process 0.
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 28697
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:41 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [api_server.py:1611] Supported_tasks: ['generate']
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m WARNING 08-12 01:42:41 [__init__.py:1610] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_responses.py:120] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_responses.py:149] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_chat.py:93] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_chat.py:133] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [serving_completion.py:77] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:43465
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:29] Available routes are:
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /docs, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /health, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /load, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /ping, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /ping, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /tokenize, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /detokenize, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/models, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /version, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/completions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/embeddings, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /pooling, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /classify, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /score, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/score, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v1/rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /v2/rerank, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /invocations, Methods: POST
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:41 [launcher.py:37] Route: /metrics, Methods: GET
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Started server process [�[36m13457�[0m]
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Waiting for application startup.
[2025-08-12T08:42:41Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Application startup complete.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:42 [async_llm.py:557] Called check_health.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50620 - "�[1mGET /health HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:42Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:42 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:42Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:42 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:42Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50624 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:43Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:43 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:43Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:43 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:43Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50624 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:44Z] �[32mPASSED�[0m
[2025-08-12T08:42:44Z] entrypoints/openai/test_return_token_ids.py::test_chat_completion_with_tool_use �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:44 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[2025-08-12T08:42:44Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:44 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:44Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:44 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:44Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50626 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:45Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:45 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:45Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:45 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:45Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:50626 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:46Z] �[32mPASSED�[0m
[2025-08-12T08:42:46Z] entrypoints/openai/test_return_token_ids.py::test_comparison_with_prompt_logprobs_and_logprobs �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:46 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:46Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:46 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:46Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39300 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:47Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39300 - "�[1mPOST /v1/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:47Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:47 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:47Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:47 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:48Z] �[32mPASSED�[0m
[2025-08-12T08:42:49Z] entrypoints/openai/test_return_token_ids.py::test_chat_completion_with_emoji_and_token_ids �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:49 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:49Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:49 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:49Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39312 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:50Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     127.0.0.1:39312 - "�[1mPOST /v1/chat/completions HTTP/1.1�[0m" �[32m200 OK�[0m
[2025-08-12T08:42:50Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:50 [core.py:717] EngineCore loop active.
[2025-08-12T08:42:50Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:50 [core.py:711] EngineCore waiting for work.
[2025-08-12T08:42:51Z] �[32mPASSED�[0m�[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:51 [launcher.py:77] port 43465 is used by process psutil.Process(pid=13457, name='vllm', status='running', started='01:42:10') launched with command:

[2025-08-12T08:42:51Z] �[1;36m(APIServer pid=13457)�[0;0m DEBUG 08-12 01:42:51 [launcher.py:77] /usr/bin/python3 /usr/local/bin/vllm serve Qwen/Qwen2.5-1.5B-Instruct --max-model-len 2048 --max-num-seqs 128 --enable-auto-tool-choice --tool-call-parser hermes --enforce-eager --gpu-memory-utilization 0.7 --port 43465 --seed 0
[2025-08-12T08:42:51Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:51 [launcher.py:80] Shutting down FastAPI HTTP server.
[2025-08-12T08:42:51Z] �[1;36m(EngineCore_0 pid=13564)�[0;0m DEBUG 08-12 01:42:51 [core.py:679] EngineCore exiting.
[2025-08-12T08:42:51Z] [rank0]:[W812 01:42:51.638223122 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Shutting down
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m INFO 08-12 01:42:52 [loggers.py:123] Engine 000: Avg prompt throughput: 47.1 tokens/s, Avg generation throughput: 12.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 43.6%
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Waiting for application shutdown.
[2025-08-12T08:42:52Z] �[1;36m(APIServer pid=13457)�[0;0m �[32mINFO�[0m:     Application shutdown complete.
[2025-08-12T08:42:52Z] 

I failed to find failed tests though. XFAIL doesn't matter I guess? I got some SUBFAIL like this:

[2025-08-12T08:36:31Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/chat/completions] �[31m(verbose_name='POST /v1/chat/completions') SUBFAIL�[0m
[2025-08-12T08:36:34Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/completions] �[31m(verbose_name='POST /v1/completions') SUBFAIL�[0m
[2025-08-12T08:36:35Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/embeddings] �[31m(verbose_name='POST /v1/embeddings') SUBFAIL�[0m
[2025-08-12T08:36:37Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /pooling] �[31m(verbose_name='POST /pooling') SUBFAIL�[0m
[2025-08-12T08:36:39Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /classify] �[31m(verbose_name='POST /classify') SUBFAIL�[0m
[2025-08-12T08:36:42Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /score] �[31m(verbose_name='POST /score') SUBFAIL�[0m
[2025-08-12T08:36:43Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/score] �[31m(verbose_name='POST /v1/score') SUBFAIL�[0m
[2025-08-12T08:36:44Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/audio/transcriptions] �[31m(verbose_name='POST /v1/audio/transcriptions') SUBFAIL�[0m
[2025-08-12T08:36:47Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/audio/translations] �[31m(verbose_name='POST /v1/audio/translations') SUBFAIL�[0m
[2025-08-12T08:36:50Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /rerank] �[31m(verbose_name='POST /rerank') SUBFAIL�[0m
[2025-08-12T08:36:53Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v1/rerank] �[31m(verbose_name='POST /v1/rerank') SUBFAIL�[0m
[2025-08-12T08:36:53Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /v2/rerank] �[31m(verbose_name='POST /v2/rerank') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /scale_elastic_ep] �[31m(verbose_name='POST /scale_elastic_ep') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /is_scaling_elastic_ep] �[31m(verbose_name='POST /is_scaling_elastic_ep') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[POST /invocations] �[31m(verbose_name='POST /invocations') SUBFAIL�[0m
[2025-08-12T08:36:54Z] entrypoints/openai/test_openai_schema.py::test_openapi_stateless[GET /metrics] �[31m(verbose_name='GET /metrics') SUBFAIL�[0m

I think it's related to the API schema change? How do I properly update the schema (i.e., openapi.json)? And why does API like rerank fail? I didn't touch them at all.

@DarkLight1337
Copy link
Member

Looks like a connection error, let me retry

@ultmaster
Copy link
Author

No. Still no luck.

The two errors are:

[2025-08-12T11:25:58Z] ==================================== ERRORS ====================================
[2025-08-12T11:25:58Z] _ ERROR at setup of test_single_request[True-christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM] _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @pytest.fixture(scope="module")
[2025-08-12T11:25:58Z]     def server():
[2025-08-12T11:25:58Z]         args = [
[2025-08-12T11:25:58Z]             "--runner",
[2025-08-12T11:25:58Z]             "pooling",
[2025-08-12T11:25:58Z]             # use half precision for speed and memory savings in CI environment
[2025-08-12T11:25:58Z]             "--dtype",
[2025-08-12T11:25:58Z]             DTYPE,
[2025-08-12T11:25:58Z]             "--enforce-eager",
[2025-08-12T11:25:58Z]             "--trust-remote-code",
[2025-08-12T11:25:58Z]             "--skip-tokenizer-init",
[2025-08-12T11:25:58Z]             "--max-num-seqs",
[2025-08-12T11:25:58Z]             "32"
[2025-08-12T11:25:58Z]         ]
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] >       with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] entrypoints/openai/test_skip_tokenizer.py:41:
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z] utils.py:144: in __init__
[2025-08-12T11:25:58Z]     self._wait_for_server(url=self.url_for("health"),
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] self = <tests.utils.RemoteOpenAIServer object at 0x7f2e0fec34d0>
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     def _wait_for_server(self, *, url: str, timeout: float):
[2025-08-12T11:25:58Z]         # run health check
[2025-08-12T11:25:58Z]         start = time.time()
[2025-08-12T11:25:58Z]         client = (httpx.Client(transport=httpx.HTTPTransport(
[2025-08-12T11:25:58Z]             uds=self.uds)) if self.uds else requests)
[2025-08-12T11:25:58Z]         while True:
[2025-08-12T11:25:58Z]             try:
[2025-08-12T11:25:58Z]                 if client.get(url).status_code == 200:
[2025-08-12T11:25:58Z]                     break
[2025-08-12T11:25:58Z]             except Exception:
[2025-08-12T11:25:58Z]                 # this exception can only be raised by requests.get,
[2025-08-12T11:25:58Z]                 # which means the server is not ready yet.
[2025-08-12T11:25:58Z]                 # the stack trace is not useful, so we suppress it
[2025-08-12T11:25:58Z]                 # by using `raise from None`.
[2025-08-12T11:25:58Z]                 result = self.proc.poll()
[2025-08-12T11:25:58Z]                 if result is not None and result != 0:
[2025-08-12T11:25:58Z] >                   raise RuntimeError("Server exited unexpectedly.") from None
[2025-08-12T11:25:58Z] E                   RuntimeError: Server exited unexpectedly.
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] utils.py:174: RuntimeError
[2025-08-12T11:25:58Z] ------------------------------ Captured log setup ------------------------------
[2025-08-12T11:25:58Z] WARNING  transformers.configuration_utils:configuration_utils.py:684 The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2025-08-12T11:25:58Z] _ ERROR at setup of test_single_request[False-christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM] _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @pytest.fixture(scope="module")
[2025-08-12T11:25:58Z]     def server():
[2025-08-12T11:25:58Z]         args = [
[2025-08-12T11:25:58Z]             "--runner",
[2025-08-12T11:25:58Z]             "pooling",
[2025-08-12T11:25:58Z]             # use half precision for speed and memory savings in CI environment
[2025-08-12T11:25:58Z]             "--dtype",
[2025-08-12T11:25:58Z]             DTYPE,
[2025-08-12T11:25:58Z]             "--enforce-eager",
[2025-08-12T11:25:58Z]             "--trust-remote-code",
[2025-08-12T11:25:58Z]             "--skip-tokenizer-init",
[2025-08-12T11:25:58Z]             "--max-num-seqs",
[2025-08-12T11:25:58Z]             "32"
[2025-08-12T11:25:58Z]         ]
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] >       with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] entrypoints/openai/test_skip_tokenizer.py:41:
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z] utils.py:144: in __init__
[2025-08-12T11:25:58Z]     self._wait_for_server(url=self.url_for("health"),
[2025-08-12T11:25:58Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] self = <tests.utils.RemoteOpenAIServer object at 0x7f2e0fec34d0>
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     def _wait_for_server(self, *, url: str, timeout: float):
[2025-08-12T11:25:58Z]         # run health check
[2025-08-12T11:25:58Z]         start = time.time()
[2025-08-12T11:25:58Z]         client = (httpx.Client(transport=httpx.HTTPTransport(
[2025-08-12T11:25:58Z]             uds=self.uds)) if self.uds else requests)
[2025-08-12T11:25:58Z]         while True:
[2025-08-12T11:25:58Z]             try:
[2025-08-12T11:25:58Z]                 if client.get(url).status_code == 200:
[2025-08-12T11:25:58Z]                     break
[2025-08-12T11:25:58Z]             except Exception:
[2025-08-12T11:25:58Z]                 # this exception can only be raised by requests.get,
[2025-08-12T11:25:58Z]                 # which means the server is not ready yet.
[2025-08-12T11:25:58Z]                 # the stack trace is not useful, so we suppress it
[2025-08-12T11:25:58Z]                 # by using `raise from None`.
[2025-08-12T11:25:58Z]                 result = self.proc.poll()
[2025-08-12T11:25:58Z]                 if result is not None and result != 0:
[2025-08-12T11:25:58Z] >                   raise RuntimeError("Server exited unexpectedly.") from None
[2025-08-12T11:25:58Z] E                   RuntimeError: Server exited unexpectedly.
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z] utils.py:174: RuntimeError
[2025-08-12T11:25:58Z] =================================== FAILURES ===================================
[2025-08-12T11:25:58Z] ______ test_openapi_stateless (verbose_name='POST /v1/chat/completions') _______
[2025-08-12T11:25:58Z]
[2025-08-12T11:25:58Z]     @wraps(test)
[2025-08-12T11:25:58Z] >   def test_function(*args: Any, **kwargs: Any) -> Any:
[2025-08-12T11:25:58Z] E   schemathesis.exceptions.CheckFailed: Schemathesis found 2 distinct sets of failures.
[2025-08-12T11:25:58Z] E   ====================
[2025-08-12T11:25:58Z] E   
[2025-08-12T11:25:58Z] E   self = <OpenApi30 for FastAPI 0.1.0>
[2025-08-12T11:25:58Z] E   operation = APIOperation(path='/v1/chat/completions', method='post', schema=<OpenApi30 for FastAPI 0.1.0>, verbose_name='POST /v1/...est'}}, media_type='application/json', required=True, description=None)]), case_cls=<class 'schemathesis.models.Case'>)
[2025-08-12T11:25:58Z] E   response = <Response [500]>
[2025-08-12T11:25:58Z] E   
[2025-08-12T11:25:58Z] E       def validate_response(self, operation: APIOperation, response: GenericResponse) -> bool | None:
[2025-08-12T11:25:58Z] E           responses = {str(key): value for key, value in operation.definition.raw.get("responses", {}).items()}
[2025-08-12T11:25:58Z] E           status_code = str(response.status_code)
[2025-08-12T11:25:58Z] E           if status_code in responses:
[2025-08-12T11:25:58Z] E               definition = responses[status_code]
[2025-08-12T11:25:58Z] E           elif "default" in responses:

I'm still getting the 16 connection failures.

@DarkLight1337
Copy link
Member

The failing test about christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM is not caused by this PR. Maybe the rest failed because the memory didn't get cleared after the first failure

@ultmaster
Copy link
Author

@DarkLight1337 Thank you. If there is any more action that needs to be done on my side, please let me know!

@DarkLight1337
Copy link
Member

That issue should be fixed in latest main so you can try merging from main again

@ultmaster
Copy link
Author

ultmaster commented Aug 12, 2025

Still:

[2025-08-12T15:07:59Z] = 16 failed, 523 passed, 32 skipped, 1 xfailed, 48 warnings, 11 subtests passed in 5579.35s (1:32:59) =

A lot of connection errors.

christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM seems to be fixed.

@DarkLight1337
Copy link
Member

Can you run those tests locally and see if they fail as well?

Comment on lines +401 to +405
# has_echoed[i] is reused here to indicate whether
# we have already returned the prompt token IDs.
if not has_echoed[i]:
prompt_token_ids_to_return = prompt_token_ids
has_echoed[i] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we always return the prompt_token_ids if return_token_ids is set, even if echo isn't set?

Maybe it would be better to only return them if echo is True?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants