-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Summary
In async rollout mode (actor_rollout_ref.rollout.mode=async), the agent loop calls the vLLM async server (vLLMHttpServerBase.generate). The server currently computes max_tokens for each vLLM generation internally in a way that can exceed the intended per-generation token budget. When EOS is not emitted early, this leads to unnecessary over-generation and large latency spikes. This is most visible with short prompts.
Where it happens (buggy computation)
verl/workers/rollout/vllm_rollout/vllm_async_server.pyclass vLLMHttpServerBaseasync def generate(...)
The server computes (or computed) something equivalent to:
max_tokens = self.config.max_model_len - len(prompt_ids)
and then uses:
SamplingParams(max_tokens=max_tokens, **sampling_params)
Given max_model_len is derived from rollout config (commonly prompt_length + response_length), for short prompts this can make max_tokens significantly larger than the intended per-call generation budget.
Why this is incorrect
In async rollout, the intended per-call max_tokens should generally be:
- Single-turn: bounded by
rollout.response_length(usually tied todata.max_response_length). - Multi-step / tool flows: bounded by the remaining response budget, e.g.
remaining = max_response_length - len(generated_ids + tool/observation_ids)
Only the agent loop has the state needed to compute this remaining budget correctly. The vLLM server does not know how many response/tool tokens have already been produced in the current trajectory unless the caller passes it explicitly.
Reproduction (minimal)
- Run async rollout:
actor_rollout_ref.rollout.mode=asyncactor_rollout_ref.rollout.name=vllm
- Use a short prompt and a prompt that tends not to emit EOS early.
- Run two experiments with identical setup except:
data.max_response_length = 32data.max_response_length = 1024
- Observe: async generation latency does not scale as expected with
max_response_length(the “32” run can still be slow), consistent withmax_tokensbeing larger than the intended per-generation budget.
Additional verification signal (recommended)
Log/inspect the number of tokens vLLM actually generated per request (e.g., len(token_ids) returned by the server). In the broken behavior, this can be much larger than the intended per-call budget on short prompts.
Expected behavior
max_tokens should reflect the intended per-call generation budget:
- Agent loop computes the requested budget (
requested_max_tokens) based on configuration and remaining response budget. - Server uses
requested_max_tokens, only applying safety clamping (e.g., remaining context window), but does not attempt to infer remaining response budget fromprompt_idsalone.
Proposed fix
- Agent loop: compute and pass an explicit per-request
sampling_params["max_tokens"]:- single turn:
max_tokens = rollout.response_length - multi-step/tool:
max_tokens = max_response_length - len(generated_ids + tool_ids)(clamped at >= 0)
- single turn:
- vLLM server: honor
sampling_params["max_tokens"]and clamp it to any hard constraint:max_tokens = min(requested_max_tokens, max_model_len - len(prompt_ids))
Scope
This issue concerns the vLLM async server path used by AgentLoop (vLLMHttpServerBase.generate). It can reproduce even with single_turn_agent (no tools) on short prompts if max_tokens is derived from max_model_len - len(prompt_ids) instead of the configured response budget.