Async rollout (AgentLoop + vLLM async server): `max_tokens` per generation is computed incorrectly, causing over-generation and slowdowns

### Summary
In async rollout mode (`actor_rollout_ref.rollout.mode=async`), the agent loop calls the vLLM async server (`vLLMHttpServerBase.generate`). The server currently computes `max_tokens` for each vLLM generation internally in a way that can exceed the intended per-generation token budget. When EOS is not emitted early, this leads to unnecessary over-generation and large latency spikes. This is most visible with short prompts.

### Where it happens (buggy computation)
- `verl/workers/rollout/vllm_rollout/vllm_async_server.py`
  - `class vLLMHttpServerBase`
  - `async def generate(...)`

The server computes (or computed) something equivalent to:

- `max_tokens = self.config.max_model_len - len(prompt_ids)`

and then uses:

- `SamplingParams(max_tokens=max_tokens, **sampling_params)`

Given `max_model_len` is derived from rollout config (commonly `prompt_length + response_length`), for short prompts this can make `max_tokens` significantly larger than the intended per-call generation budget.

### Why this is incorrect
In async rollout, the intended per-call `max_tokens` should generally be:
- **Single-turn**: bounded by `rollout.response_length` (usually tied to `data.max_response_length`).
- **Multi-step / tool flows**: bounded by the *remaining response budget*, e.g.
  `remaining = max_response_length - len(generated_ids + tool/observation_ids)`

Only the **agent loop** has the state needed to compute this remaining budget correctly. The vLLM server does not know how many response/tool tokens have already been produced in the current trajectory unless the caller passes it explicitly.

### Reproduction (minimal)
1. Run async rollout:
   - `actor_rollout_ref.rollout.mode=async`
   - `actor_rollout_ref.rollout.name=vllm`
2. Use a short prompt and a prompt that tends not to emit EOS early.
3. Run two experiments with identical setup except:
   - `data.max_response_length = 32`
   - `data.max_response_length = 1024`
4. Observe: async generation latency does not scale as expected with `max_response_length` (the “32” run can still be slow), consistent with `max_tokens` being larger than the intended per-generation budget.

### Additional verification signal (recommended)
Log/inspect the number of tokens vLLM actually generated per request (e.g., `len(token_ids)` returned by the server). In the broken behavior, this can be much larger than the intended per-call budget on short prompts.

### Expected behavior
`max_tokens` should reflect the intended per-call generation budget:
- **Agent loop** computes the requested budget (`requested_max_tokens`) based on configuration and remaining response budget.
- **Server** uses `requested_max_tokens`, only applying safety clamping (e.g., remaining context window), but does not attempt to infer remaining response budget from `prompt_ids` alone.

### Proposed fix
1. **Agent loop**: compute and pass an explicit per-request `sampling_params["max_tokens"]`:
   - single turn: `max_tokens = rollout.response_length`
   - multi-step/tool: `max_tokens = max_response_length - len(generated_ids + tool_ids)` (clamped at >= 0)
2. **vLLM server**: honor `sampling_params["max_tokens"]` and clamp it to any hard constraint:
   - `max_tokens = min(requested_max_tokens, max_model_len - len(prompt_ids))`

### Scope
This issue concerns the **vLLM async server** path used by AgentLoop (`vLLMHttpServerBase.generate`). It can reproduce even with `single_turn_agent` (no tools) on short prompts if `max_tokens` is derived from `max_model_len - len(prompt_ids)` instead of the configured response budget.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Async rollout (AgentLoop + vLLM async server): `max_tokens` per generation is computed incorrectly, causing over-generation and slowdowns #4568

Summary

Where it happens (buggy computation)

Why this is incorrect

Reproduction (minimal)

Additional verification signal (recommended)

Expected behavior

Proposed fix

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Async rollout (AgentLoop + vLLM async server): max_tokens per generation is computed incorrectly, causing over-generation and slowdowns #4568

Description

Summary

Where it happens (buggy computation)

Why this is incorrect

Reproduction (minimal)

Additional verification signal (recommended)

Expected behavior

Proposed fix

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Async rollout (AgentLoop + vLLM async server): `max_tokens` per generation is computed incorrectly, causing over-generation and slowdowns #4568