[CI] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2#34999
Conversation
…es in GDN attention The GDN attention metadata builder had an assertion that prevented batches containing both regular decode requests and speculative decode requests. This assertion was introduced in vllm-project#34077 as a defensive check, but it is overly conservative. Mixed batches naturally occur during MTP speculative decoding when a request enters its first decode step (no draft tokens yet) while other requests are already spec-decoding. The metadata builder's else branch (line 247) already computes separate spec/non-spec tensors correctly for this case, and the model forward pass in qwen3_next.py handles mixed batches by separating, processing, and merging tokens independently. CUDAGraphs are unaffected: the two CUDAGraph preparation blocks already exclude mixed batches via their guard conditions (num_decodes==0 and num_spec_decodes==0 respectively), so mixed batches fall back to eager execution. Fixes vllm-project#34993 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the GDN attention backend by removing an assertion that was overly conservative. The assertion prevented batches from containing both speculative and non-speculative decode requests simultaneously. My analysis of the surrounding code confirms that the logic is designed to handle such mixed batches by partitioning requests and preparing separate metadata for each type. Therefore, removing the assertion is the correct fix. The change is minimal, targeted, and seems to resolve the issue as described.
|
As much as I understand this code(and around) there shouldn't be case where both Maybe some prefill was counted as |
|
Ya, decode is a loose term, short enough prefill chunks are considered "decodes" from a reordering perspective |
I am not sure that code of this function will work correctly in case we count prefill as decode as well as the model implementation. What is the reason that short prefill we consider as decode? |
FIX: #34993
#34077 broke
pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k "Qwen3-Next-80B-A3B-NVFP4-EP2" --config-list-file=tests/evals/gsm8k/configs/models-blackwell.txtwith what appears to be an overly conservative assertTest Plan:
Now passes
cc @vadiklyutiy