Skip to content

Comments

[CI] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2#34999

Open
LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/fix-gdn-mixed-spec-decode-assert
Open

[CI] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2#34999
LucasWilkinson wants to merge 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/fix-gdn-mixed-spec-decode-assert

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 20, 2026

FIX: #34993

#34077 broke pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k "Qwen3-Next-80B-A3B-NVFP4-EP2" --config-list-file=tests/evals/gsm8k/configs/models-blackwell.txt with what appears to be an overly conservative assert

Test Plan:

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k "Qwen3-Next-80B-A3B-NVFP4-EP2" --config-list-file=tests/evals/gsm8k/configs/models-blackwell.txt

Now passes

cc @vadiklyutiy

…es in GDN attention

The GDN attention metadata builder had an assertion that prevented
batches containing both regular decode requests and speculative decode
requests. This assertion was introduced in vllm-project#34077 as a defensive check,
but it is overly conservative.

Mixed batches naturally occur during MTP speculative decoding when a
request enters its first decode step (no draft tokens yet) while other
requests are already spec-decoding. The metadata builder's else branch
(line 247) already computes separate spec/non-spec tensors correctly
for this case, and the model forward pass in qwen3_next.py handles
mixed batches by separating, processing, and merging tokens
independently.

CUDAGraphs are unaffected: the two CUDAGraph preparation blocks already
exclude mixed batches via their guard conditions (num_decodes==0 and
num_spec_decodes==0 respectively), so mixed batches fall back to eager
execution.

Fixes vllm-project#34993

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@dosubot
Copy link

dosubot bot commented Feb 20, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@LucasWilkinson LucasWilkinson changed the title [Bugfix] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2 [CI] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2 Feb 20, 2026
@mergify mergify bot added qwen Related to Qwen models v1 labels Feb 20, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the GDN attention backend by removing an assertion that was overly conservative. The assertion prevented batches from containing both speculative and non-speculative decode requests simultaneously. My analysis of the surrounding code confirms that the logic is designed to handle such mixed batches by partitioning requests and preparing separate metadata for each type. Therefore, removing the assertion is the correct fix. The change is minimal, targeted, and seems to resolve the issue as described.

@vadiklyutiy
Copy link
Collaborator

As much as I understand this code(and around) there shouldn't be case where both num_decodes and num_spec_decodes presents. Without spec decode - all go to num_decodes with spec decode - all go to num_spec_decodes.

Maybe some prefill was counted as num_decode (I recall DS has such issue).

@LucasWilkinson
Copy link
Collaborator Author

Ya, decode is a loose term, short enough prefill chunks are considered "decodes" from a reordering perspective

@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 21, 2026
@vadiklyutiy
Copy link
Collaborator

Ya, decode is a loose term, short enough prefill chunks are considered "decodes" from a reordering perspective

I am not sure that code of this function will work correctly in case we count prefill as decode as well as the model implementation.

What is the reason that short prefill we consider as decode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] GDN attention backend assertion failure with MTP speculative decoding

3 participants