Skip to content

Bugs in using an external draft model and set --sglang-speculative-algorithm EAGLE3 #1177

@bingyang-lei

Description

@bingyang-lei

Bug Description

When I try to use a draft model to do speculative decoding, like describe in #1022, I find another unexpected error, and all configs I use is same in #1022.
The error is:

[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) [2025-12-18 08:24:01 TP7] Scheduler hit an exception: Traceback (most recent call last):
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     scheduler.event_loop_normal()
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return func(*args, **kwargs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     self.process_input_requests(recv_reqs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     output = self._request_dispatcher(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return fn(obj)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 88, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     success, message = worker.update_weights_from_tensor(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 1029, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     success, message = self.model_runner.update_weights_from_tensor(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1285, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     return self._update_weights_from_flattened_bucket(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1336, in _update_weights_from_flattened_bucket
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     self.model.load_weights(reconstructed_tensors)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 270, in load_weights
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     weight_loader(param, loaded_weight)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)   File "/sgl-workspace/sglang/python/sglang/srt/layers/vocab_parallel_embedding.py", line 452, in weight_loader
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)     assert loaded_weight.shape[output_dim] == (
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) AssertionError: self.org_vocab_size=32000 self.use_presharded_weights=False loaded_weight.shape[output_dim]=151936

it seems the SGlangEngine not get the right vocab_size for draft model, because the draft_vocab_size is usually smaller than target_vocab_size.

And Surprisingly, simply changing the--sglang-speculative-algorithm parameter from EAGLE3 to EAGLE restored normal program operation, yet the decode phase consistently displayed an accept len of 1—a critical flaw for speculative decoding. Like this:

Decode batch, #running-req: 16, #token: 63716, token usage: 0.14, accept len: 1.00, accept rate: 0.25, cuda graph: True, gen throughput (token/s): 1279.93, #queue-req: 0

This suggests the draft model weights were never properly loaded when using EAGLE as the parameter.

Steps for reproducing the bug

#1022 provides a complete reproduce pipeline, and the configuration I used is identical to it.

Expected behavior

when use --sglang-speculative-algorithm EAGLE3, we should see the weight load rightly, and the RL rollout process is proceeding correctly. The accept len should be much longer than 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions