-
Notifications
You must be signed in to change notification settings - Fork 398
Description
Bug Description
When I try to use a draft model to do speculative decoding, like describe in #1022, I find another unexpected error, and all configs I use is same in #1022.
The error is:
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) [2025-12-18 08:24:01 TP7] Scheduler hit an exception: Traceback (most recent call last):
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) scheduler.event_loop_normal()
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) return func(*args, **kwargs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) self.process_input_requests(recv_reqs)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) output = self._request_dispatcher(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) return fn(obj)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 88, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) success, message = worker.update_weights_from_tensor(recv_req)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 1029, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) success, message = self.model_runner.update_weights_from_tensor(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1285, in update_weights_from_tensor
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) return self._update_weights_from_flattened_bucket(
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1336, in _update_weights_from_flattened_bucket
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) self.model.load_weights(reconstructed_tensors)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 270, in load_weights
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) weight_loader(param, loaded_weight)
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) File "/sgl-workspace/sglang/python/sglang/srt/layers/vocab_parallel_embedding.py", line 452, in weight_loader
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) assert loaded_weight.shape[output_dim] == (
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-12-18T16:24:02+08:00] leihaodi-qwen3-8b-rl-eagle3-20251218-162008s-3da56-cfcd2 >> (SGLangEngine pid=3111) AssertionError: self.org_vocab_size=32000 self.use_presharded_weights=False loaded_weight.shape[output_dim]=151936
it seems the SGlangEngine not get the right vocab_size for draft model, because the draft_vocab_size is usually smaller than target_vocab_size.
And Surprisingly, simply changing the--sglang-speculative-algorithm parameter from EAGLE3 to EAGLE restored normal program operation, yet the decode phase consistently displayed an accept len of 1—a critical flaw for speculative decoding. Like this:
Decode batch, #running-req: 16, #token: 63716, token usage: 0.14, accept len: 1.00, accept rate: 0.25, cuda graph: True, gen throughput (token/s): 1279.93, #queue-req: 0
This suggests the draft model weights were never properly loaded when using EAGLE as the parameter.
Steps for reproducing the bug
#1022 provides a complete reproduce pipeline, and the configuration I used is identical to it.
Expected behavior
when use --sglang-speculative-algorithm EAGLE3, we should see the weight load rightly, and the RL rollout process is proceeding correctly. The accept len should be much longer than 1