Skip to content

[Bug] Qwen3-VL-30B-A3B (MoE) fails to start with speculative decoding: missing get_embed_and_head() / set_eagle3_layers_to_capture() in Qwen3VLMoeForConditionalGeneration (sglang v0.5.7) #17935

@JasonNing96

Description

@JasonNing96

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Environment

  • SGLang: 0.5.7 (docker image / pip package)

  • Model (target): Qwen3-VL-30B-A3B-Instruct-AWQ (MoE)

  • Draft model: SpecForge_qwen3-vl-model_eagle3

  • Two platforms reproduced:

    1. Jetson AGX Orin 64GB (ARM64)
    2. x86_64 server
  • dtype: bfloat16

  • Context length: 4096

  • Note: export SGLANG_DISABLE_CUDNN_CHECK=1 on Jetson


Problem Summary

When enabling speculative decoding with Qwen3-VL-30B-A3B (MoE), the server fails to start due to missing methods on the target model wrapper class Qwen3VLMoeForConditionalGeneration:

  1. With --speculative-algorithm EAGLE, SGLang crashes in EAGLEWorker because the target model has no get_embed_and_head().

  2. With --speculative-algorithm EAGLE3 (Spec V2 path), SGLang crashes during cuda graph init because the target model has no set_eagle3_layers_to_capture().

This happens consistently on both Jetson and x86.


Reproduction

A) Jetson AGX 64G (docker, sglang 0.5.7)

Command:

export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
  --tp 1 --port 8007 \
  --enable-torch-compile \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Error (key part):

AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'
  File .../sglang/srt/speculative/eagle_worker.py", line 154, in __init__
    embed, head = self.target_worker.model_runner.model.get_embed_and_head()

Also noticed warning:

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}

B) x86_64 (sglang 0.5.7), EAGLE

Command:

python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
  --tp 1 --port 8007 \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Error (same root cause):

AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'

C) x86_64 (sglang 0.5.7), EAGLE3

Command:

python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
  --tp 1 --port 8007 \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Error (key part):

AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'
  File .../sglang/srt/model_executor/cuda_graph_runner.py", line 353, in __init__
    self.model_runner.model.set_eagle3_layers_to_capture()

Expected Behavior

Server should start successfully with speculative decoding enabled (EAGLE / EAGLE3), or at least provide a clear message that Qwen3-VL MoE is not supported by EAGLE/EAGLE3 yet.


Additional Context / Suspected Cause

  • In the speculative path, workers appear to require target model wrapper to implement:

    • get_embed_and_head() (used by EAGLE worker)
    • set_eagle3_layers_to_capture() (used by EAGLE3 + cuda graph init)
  • Qwen3VLMoeForConditionalGeneration currently lacks these methods in v0.5.7.

  • There is a merged PR for Qwen2.5-VL + Eagle3 support (PR Qwen2.5-VL eagle3 infer #8801). It seems Qwen3-VL (especially MoE wrapper) may need similar adaptation.


Questions

  1. Is Qwen3-VL MoE intended to be supported by EAGLE/EAGLE3 in current releases? If not, can we get an explicit compatibility note/error message?
  2. If yes, is there an existing branch/PR that adds get_embed_and_head() and set_eagle3_layers_to_capture() (and any related aux hidden state handling) for Qwen3-VL MoE?
  3. Any recommended workaround for now (e.g., disable cuda graph / torch compile / use STANDALONE speculative / NGRAM)?

Reproduction

export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
  --tp 1 --port 8007 \
  --enable-torch-compile \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Environment

Environment

  • SGLang: 0.5.7 (docker image / pip package)

  • Model (target): Qwen3-VL-30B-A3B-Instruct-AWQ (MoE)

  • Draft model: SpecForge_qwen3-vl-model_eagle3

  • Two platforms reproduced:

    1. Jetson AGX Orin 64GB (ARM64)
    2. x86_64 server
  • dtype: bfloat16

  • Context length: 4096

  • Note: export SGLANG_DISABLE_CUDNN_CHECK=1 on Jetson

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions