Bugs in using an external draft model and set --sglang-speculative-algorithm EAGLE3

### Bug Description
When I try to use a draft model to do speculative decoding, like describe in https://github.com/THUDM/slime/issues/1022, I find another unexpected error, and all configs I use is same in https://github.com/THUDM/slime/issues/1022. 
The error is:
```
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) [2025-12-18 08:24:01 TP7] Scheduler hit an exception: Traceback (most recent call last):
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     scheduler.event_loop_normal()
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return func(*args, **kwargs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     self.process_input_requests(recv_reqs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     output = self._request_dispatcher(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return fn(obj)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 88, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     success, message = worker.update_weights_from_tensor(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 1029, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     success, message = self.model_runner.update_weights_from_tensor(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1285, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return self._update_weights_from_flattened_bucket(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1336, in _update_weights_from_flattened_bucket
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     self.model.load_weights(reconstructed_tensors)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 270, in load_weights
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     weight_loader(param, loaded_weight)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/layers/vocab_parallel_embedding.py", line 452, in weight_loader
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     assert loaded_weight.shape[output_dim] == (
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) AssertionError: self.org_vocab_size=32000 self.use_presharded_weights=False loaded_weight.shape[output_dim]=151936
```

it seems the SGlangEngine not get the right vocab_size for draft model, because the draft_vocab_size is usually smaller than target_vocab_size.

And Surprisingly, simply changing the--sglang-speculative-algorithm parameter from EAGLE3 to EAGLE restored normal program operation, yet the decode phase consistently displayed an accept len of 1—a critical flaw for speculative decoding. Like this:
```
Decode batch, #running-req: 16, #token: 63716, token usage: 0.14, accept len: 1.00, accept rate: 0.25, cuda graph: True, gen throughput (token/s): 1279.93, #queue-req: 0
```
This suggests the draft model weights were never properly loaded when using EAGLE as the parameter.


### Steps for reproducing the bug
https://github.com/THUDM/slime/issues/1022 provides a complete reproduce pipeline, and the configuration I used is identical to it.

### Expected behavior
when use --sglang-speculative-algorithm EAGLE3, we should see the weight load rightly, and the RL rollout process is proceeding correctly. The accept len should be much longer than 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugs in using an external draft model and set --sglang-speculative-algorithm EAGLE3 #1177

Bug Description

Steps for reproducing the bug

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bugs in using an external draft model and set --sglang-speculative-algorithm EAGLE3 #1177

Description

Bug Description

Steps for reproducing the bug

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions