-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
Environment
-
SGLang:
0.5.7(docker image / pip package) -
Model (target):
Qwen3-VL-30B-A3B-Instruct-AWQ(MoE) -
Draft model:
SpecForge_qwen3-vl-model_eagle3 -
Two platforms reproduced:
- Jetson AGX Orin 64GB (ARM64)
- x86_64 server
-
dtype:
bfloat16 -
Context length:
4096 -
Note:
export SGLANG_DISABLE_CUDNN_CHECK=1on Jetson
Problem Summary
When enabling speculative decoding with Qwen3-VL-30B-A3B (MoE), the server fails to start due to missing methods on the target model wrapper class Qwen3VLMoeForConditionalGeneration:
-
With
--speculative-algorithm EAGLE, SGLang crashes inEAGLEWorkerbecause the target model has noget_embed_and_head(). -
With
--speculative-algorithm EAGLE3(Spec V2 path), SGLang crashes during cuda graph init because the target model has noset_eagle3_layers_to_capture().
This happens consistently on both Jetson and x86.
Reproduction
A) Jetson AGX 64G (docker, sglang 0.5.7)
Command:
export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
--model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
--speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
--tp 1 --port 8007 \
--enable-torch-compile \
--context-length 4096 --chunked-prefill-size 4096 \
--mem-fraction-static 0.6 \
--trust-remote-code \
--served-model-name Qwen3_VL_8B \
--cuda-graph-max-bs 4 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Error (key part):
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'
File .../sglang/srt/speculative/eagle_worker.py", line 154, in __init__
embed, head = self.target_worker.model_runner.model.get_embed_and_head()
Also noticed warning:
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
B) x86_64 (sglang 0.5.7), EAGLE
Command:
python3 -m sglang.launch_server \
--model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
--speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
--tp 1 --port 8007 \
--context-length 4096 --chunked-prefill-size 4096 \
--mem-fraction-static 0.6 \
--trust-remote-code \
--served-model-name Qwen3_VL_8B \
--cuda-graph-max-bs 4 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Error (same root cause):
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'
C) x86_64 (sglang 0.5.7), EAGLE3
Command:
python3 -m sglang.launch_server \
--model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
--speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
--tp 1 --port 8007 \
--context-length 4096 --chunked-prefill-size 4096 \
--mem-fraction-static 0.6 \
--trust-remote-code \
--served-model-name Qwen3_VL_8B \
--cuda-graph-max-bs 4 \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Error (key part):
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'
File .../sglang/srt/model_executor/cuda_graph_runner.py", line 353, in __init__
self.model_runner.model.set_eagle3_layers_to_capture()
Expected Behavior
Server should start successfully with speculative decoding enabled (EAGLE / EAGLE3), or at least provide a clear message that Qwen3-VL MoE is not supported by EAGLE/EAGLE3 yet.
Additional Context / Suspected Cause
-
In the speculative path, workers appear to require target model wrapper to implement:
get_embed_and_head()(used by EAGLE worker)set_eagle3_layers_to_capture()(used by EAGLE3 + cuda graph init)
-
Qwen3VLMoeForConditionalGenerationcurrently lacks these methods in v0.5.7. -
There is a merged PR for Qwen2.5-VL + Eagle3 support (PR Qwen2.5-VL eagle3 infer #8801). It seems Qwen3-VL (especially MoE wrapper) may need similar adaptation.
Questions
- Is Qwen3-VL MoE intended to be supported by EAGLE/EAGLE3 in current releases? If not, can we get an explicit compatibility note/error message?
- If yes, is there an existing branch/PR that adds
get_embed_and_head()andset_eagle3_layers_to_capture()(and any related aux hidden state handling) for Qwen3-VL MoE? - Any recommended workaround for now (e.g., disable cuda graph / torch compile / use STANDALONE speculative / NGRAM)?
Reproduction
export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
--model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
--speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
--tp 1 --port 8007 \
--enable-torch-compile \
--context-length 4096 --chunked-prefill-size 4096 \
--mem-fraction-static 0.6 \
--trust-remote-code \
--served-model-name Qwen3_VL_8B \
--cuda-graph-max-bs 4 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Environment
Environment
-
SGLang:
0.5.7(docker image / pip package) -
Model (target):
Qwen3-VL-30B-A3B-Instruct-AWQ(MoE) -
Draft model:
SpecForge_qwen3-vl-model_eagle3 -
Two platforms reproduced:
- Jetson AGX Orin 64GB (ARM64)
- x86_64 server
-
dtype:
bfloat16 -
Context length:
4096 -
Note:
export SGLANG_DISABLE_CUDNN_CHECK=1on Jetson