Skip to content

Async rollout (AgentLoop + vLLM async server): max_tokens per generation is computed incorrectly, causing over-generation and slowdowns #4568

@roeehendel

Description

@roeehendel

Summary

In async rollout mode (actor_rollout_ref.rollout.mode=async), the agent loop calls the vLLM async server (vLLMHttpServerBase.generate). The server currently computes max_tokens for each vLLM generation internally in a way that can exceed the intended per-generation token budget. When EOS is not emitted early, this leads to unnecessary over-generation and large latency spikes. This is most visible with short prompts.

Where it happens (buggy computation)

  • verl/workers/rollout/vllm_rollout/vllm_async_server.py
    • class vLLMHttpServerBase
    • async def generate(...)

The server computes (or computed) something equivalent to:

  • max_tokens = self.config.max_model_len - len(prompt_ids)

and then uses:

  • SamplingParams(max_tokens=max_tokens, **sampling_params)

Given max_model_len is derived from rollout config (commonly prompt_length + response_length), for short prompts this can make max_tokens significantly larger than the intended per-call generation budget.

Why this is incorrect

In async rollout, the intended per-call max_tokens should generally be:

  • Single-turn: bounded by rollout.response_length (usually tied to data.max_response_length).
  • Multi-step / tool flows: bounded by the remaining response budget, e.g.
    remaining = max_response_length - len(generated_ids + tool/observation_ids)

Only the agent loop has the state needed to compute this remaining budget correctly. The vLLM server does not know how many response/tool tokens have already been produced in the current trajectory unless the caller passes it explicitly.

Reproduction (minimal)

  1. Run async rollout:
    • actor_rollout_ref.rollout.mode=async
    • actor_rollout_ref.rollout.name=vllm
  2. Use a short prompt and a prompt that tends not to emit EOS early.
  3. Run two experiments with identical setup except:
    • data.max_response_length = 32
    • data.max_response_length = 1024
  4. Observe: async generation latency does not scale as expected with max_response_length (the “32” run can still be slow), consistent with max_tokens being larger than the intended per-generation budget.

Additional verification signal (recommended)

Log/inspect the number of tokens vLLM actually generated per request (e.g., len(token_ids) returned by the server). In the broken behavior, this can be much larger than the intended per-call budget on short prompts.

Expected behavior

max_tokens should reflect the intended per-call generation budget:

  • Agent loop computes the requested budget (requested_max_tokens) based on configuration and remaining response budget.
  • Server uses requested_max_tokens, only applying safety clamping (e.g., remaining context window), but does not attempt to infer remaining response budget from prompt_ids alone.

Proposed fix

  1. Agent loop: compute and pass an explicit per-request sampling_params["max_tokens"]:
    • single turn: max_tokens = rollout.response_length
    • multi-step/tool: max_tokens = max_response_length - len(generated_ids + tool_ids) (clamped at >= 0)
  2. vLLM server: honor sampling_params["max_tokens"] and clamp it to any hard constraint:
    • max_tokens = min(requested_max_tokens, max_model_len - len(prompt_ids))

Scope

This issue concerns the vLLM async server path used by AgentLoop (vLLMHttpServerBase.generate). It can reproduce even with single_turn_agent (no tools) on short prompts if max_tokens is derived from max_model_len - len(prompt_ids) instead of the configured response budget.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions