[Bug] Qwen3-VL-30B-A3B (MoE) fails to start with speculative decoding: missing get_embed_and_head() / set_eagle3_layers_to_capture() in Qwen3VLMoeForConditionalGeneration (sglang v0.5.7)

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

**Environment**

* SGLang: `0.5.7` (docker image / pip package)
* Model (target): `Qwen3-VL-30B-A3B-Instruct-AWQ` (MoE)
* Draft model: `SpecForge_qwen3-vl-model_eagle3`
* Two platforms reproduced:

  1. Jetson AGX Orin 64GB (ARM64)
  2. x86_64 server
* dtype: `bfloat16`
* Context length: `4096`
* Note: `export SGLANG_DISABLE_CUDNN_CHECK=1` on Jetson

---

## Problem Summary

When enabling speculative decoding with Qwen3-VL-30B-A3B (MoE), the server fails to start due to missing methods on the target model wrapper class `Qwen3VLMoeForConditionalGeneration`:

1. With `--speculative-algorithm EAGLE`, SGLang crashes in `EAGLEWorker` because the target model has no `get_embed_and_head()`.

2. With `--speculative-algorithm EAGLE3` (Spec V2 path), SGLang crashes during cuda graph init because the target model has no `set_eagle3_layers_to_capture()`.

This happens consistently on both Jetson and x86.

---

## Reproduction

### A) Jetson AGX 64G (docker, sglang 0.5.7)

Command:

```bash
export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
  --tp 1 --port 8007 \
  --enable-torch-compile \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4
```

Error (key part):

```
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'
  File .../sglang/srt/speculative/eagle_worker.py", line 154, in __init__
    embed, head = self.target_worker.model_runner.model.get_embed_and_head()
```

Also noticed warning:

```
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
```

---

### B) x86_64 (sglang 0.5.7), EAGLE

Command:

```bash
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
  --tp 1 --port 8007 \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4
```

Error (same root cause):

```
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'get_embed_and_head'
```

---

### C) x86_64 (sglang 0.5.7), EAGLE3

Command:

```bash
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ/ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3/ \
  --tp 1 --port 8007 \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4
```

Error (key part):

```
AttributeError: 'Qwen3VLMoeForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'
  File .../sglang/srt/model_executor/cuda_graph_runner.py", line 353, in __init__
    self.model_runner.model.set_eagle3_layers_to_capture()
```

---

## Expected Behavior

Server should start successfully with speculative decoding enabled (EAGLE / EAGLE3), or at least provide a clear message that Qwen3-VL MoE is not supported by EAGLE/EAGLE3 yet.

---

## Additional Context / Suspected Cause

* In the speculative path, workers appear to require target model wrapper to implement:

  * `get_embed_and_head()` (used by EAGLE worker)
  * `set_eagle3_layers_to_capture()` (used by EAGLE3 + cuda graph init)
* `Qwen3VLMoeForConditionalGeneration` currently lacks these methods in v0.5.7.
* There is a merged PR for Qwen2.5-VL + Eagle3 support (PR #8801). It seems Qwen3-VL (especially MoE wrapper) may need similar adaptation.

---

## Questions

1. Is Qwen3-VL MoE intended to be supported by EAGLE/EAGLE3 in current releases? If not, can we get an explicit compatibility note/error message?
2. If yes, is there an existing branch/PR that adds `get_embed_and_head()` and `set_eagle3_layers_to_capture()` (and any related aux hidden state handling) for Qwen3-VL MoE?
3. Any recommended workaround for now (e.g., disable cuda graph / torch compile / use STANDALONE speculative / NGRAM)?


### Reproduction

```bash
export SGLANG_DISABLE_CUDNN_CHECK=1
python3 -m sglang.launch_server \
  --model-path /home/stardust/zc/models/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --speculative-draft-model-path /home/stardust/zc/models/SpecForge_qwen3-vl-model_eagle3 \
  --tp 1 --port 8007 \
  --enable-torch-compile \
  --context-length 4096 --chunked-prefill-size 4096 \
  --mem-fraction-static 0.6 \
  --trust-remote-code \
  --served-model-name Qwen3_VL_8B \
  --cuda-graph-max-bs 4 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4
```

### Environment

**Environment**

* SGLang: `0.5.7` (docker image / pip package)
* Model (target): `Qwen3-VL-30B-A3B-Instruct-AWQ` (MoE)
* Draft model: `SpecForge_qwen3-vl-model_eagle3`
* Two platforms reproduced:

  1. Jetson AGX Orin 64GB (ARM64)
  2. x86_64 server
* dtype: `bfloat16`
* Context length: `4096`
* Note: `export SGLANG_DISABLE_CUDNN_CHECK=1` on Jetson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Qwen3-VL-30B-A3B (MoE) fails to start with speculative decoding: missing get_embed_and_head() / set_eagle3_layers_to_capture() in Qwen3VLMoeForConditionalGeneration (sglang v0.5.7) #17935

Checklist

Describe the bug

Problem Summary

Reproduction

A) Jetson AGX 64G (docker, sglang 0.5.7)

B) x86_64 (sglang 0.5.7), EAGLE

C) x86_64 (sglang 0.5.7), EAGLE3

Expected Behavior

Additional Context / Suspected Cause

Questions

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Qwen3-VL-30B-A3B (MoE) fails to start with speculative decoding: missing get_embed_and_head() / set_eagle3_layers_to_capture() in Qwen3VLMoeForConditionalGeneration (sglang v0.5.7) #17935

Description

Checklist

Describe the bug

Problem Summary

Reproduction

A) Jetson AGX 64G (docker, sglang 0.5.7)

B) x86_64 (sglang 0.5.7), EAGLE

C) x86_64 (sglang 0.5.7), EAGLE3

Expected Behavior

Additional Context / Suspected Cause

Questions

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions