Skip to content

[vllm, rollout, cfg, doc] feat: Accelerate RL rollouts with EAGLE/EAGLE3 speculative decoding#5925

Open
alekseymalakhov11 wants to merge 8 commits intoverl-project:mainfrom
alekseymalakhov11:add-eagle-speculative-decoding
Open

[vllm, rollout, cfg, doc] feat: Accelerate RL rollouts with EAGLE/EAGLE3 speculative decoding#5925
alekseymalakhov11 wants to merge 8 commits intoverl-project:mainfrom
alekseymalakhov11:add-eagle-speculative-decoding

Conversation

@alekseymalakhov11
Copy link
Copy Markdown

What does this PR do?

Add inference-only speculative decoding for vLLM rollouts using EAGLE / EAGLE3 draft models. This PR adds the vLLM rollout/config wiring for the new speculative decoding path, reload handling for the draft model during weight sync, rollout metrics for speculative decoding, and documentation.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: [rollout] feat: support eagle3 speculative decode in rollout
    [rollout] feat: support eagle3 speculative decode in rollout #5509

  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)

    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Screenshot 2026-04-08 at 16 27 09 Screenshot 2026-04-08 at 16 26 05 Screenshot 2026-04-08 at 16 25 14

We used qwen3-8b together with RedHatAI/Qwen3-8B-speculator.eagle3 on verl-team/lighteval-MATH-preprocessed.

In our experiments, the final reward is comparable to the baseline. The screenshots also highlights two implementation details that are important for making this work correctly in RL rollout: reloading the draft-model weights after actor weight sync, and rebuilding the RoPE cos_sin cache after reload. Also it shows speculative decoding metrics such as acceptance rate and mean acceptance length so the behavior and performance of EAGLE/EAGLE3 can be analyzed directly during training.

In some settings, performance of draft model may degrade (acceptance length drops). Even with that limitation, this change opens a useful direction for more practical research on speculative decoding in RL for LLMs, and makes it easier to explore follow-up ideas such as retraining or adapting the draft model.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for inference-only speculative decoding in vLLM rollout, including necessary configuration updates, metrics tracking, and weight loading adjustments. I have identified two issues: an incorrect handling of the acceptance rate when no draft tokens are generated, and a malformed f-string in a validation error message.

Comment on lines +1180 to +1182
spec_delta["num_accepted_tokens"] / spec_delta["num_draft_tokens"]
if spec_delta["num_draft_tokens"] > 0
else float("inf")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation for acceptance_rate when spec_delta["num_draft_tokens"] is 0 results in float("inf"). This is likely incorrect and can cause issues with metric aggregation (e.g., np.mean over inf will be inf). When no draft tokens are generated, the acceptance rate should be 0.0.

Suggested change
spec_delta["num_accepted_tokens"] / spec_delta["num_draft_tokens"]
if spec_delta["num_draft_tokens"] > 0
else float("inf")
spec_delta["num_accepted_tokens"] / spec_delta["num_draft_tokens"]
if spec_delta["num_draft_tokens"] > 0
else 0.0

Comment on lines +369 to +373
raise ValueError(
f"draft_tensor_parallel_size={self.speculative_decoding.draft_tensor_parallel_size} "
"cannot be other value than 1 or target model "
"tensor_parallel_size={self.tensor_model_parallel_size} "
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The f-string for the ValueError message is malformed. The variable self.tensor_model_parallel_size is outside the curly braces, so it will be printed literally instead of its value being interpolated. This will produce a confusing error message for users.

Suggested change
raise ValueError(
f"draft_tensor_parallel_size={self.speculative_decoding.draft_tensor_parallel_size} "
"cannot be other value than 1 or target model "
"tensor_parallel_size={self.tensor_model_parallel_size} "
)
raise ValueError(
f"draft_tensor_parallel_size={self.speculative_decoding.draft_tensor_parallel_size} "
"cannot be other value than 1 or target model "
f"tensor_parallel_size={self.tensor_model_parallel_size}"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant