Support NVFP4 KV for prefill and batch attention kernels#2820
Support NVFP4 KV for prefill and batch attention kernels#2820Tom-Zheng wants to merge 5 commits intoflashinfer-ai:mainfrom
Conversation
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR introduces NVFP4 (4-bit floating-point) KV cache support with per-block scale factors throughout the attention and prefill pipelines. Changes include Python API extensions to accept KV block scales, CUDA kernel modifications to load and apply scale factors, new dequantization paths for FP4, and corresponding test coverage. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip Migrating from UI to YAML configuration.Use the |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances FlashInfer's capabilities by integrating native support for NVFP4 KV cache input across its batch prefill and batch attention kernels. This allows for more memory-efficient and potentially faster inference on NVIDIA GPUs by leveraging 4-bit quantization for key and value tensors. The changes span from low-level CUDA kernel implementations for data loading and computation to Python-level utilities for quantization, dequantization, and comprehensive testing, ensuring broad compatibility and correctness. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Code Review
This pull request introduces support for NVFP4 (NVIDIA FP4) KV cache quantization in FlashInfer's prefill and batch attention operations. Key changes include updating benchmark routines to handle NVFP4 as a KV data type, adjusting tolerances for lower precision, and filtering unsupported backends. The core C++ kernels and Python JIT modules are extended to manage packed NVFP4 data and per-group scale factors, including modifications to memory access patterns and MMA operations for proper dequantization. Review comments highlight the need for improved clarity in the global_scale calculation in _to_nvfp4 and detailed explanations for the intricate scaling factor application logic within the compute_qk and compute_sfm_v device functions.
benchmarks/routines/attention.py
Outdated
|
|
||
| min_val, max_val = x.float().aminmax() | ||
| amax = torch.maximum(min_val.abs(), max_val.abs()).clamp(min=1e-12) | ||
| global_scale = amax / (448.0 * 6.0) |
| if constexpr (is_fp4_type_v<typename KTraits::DTypeKV>) { | ||
| // Apply scaling factors for K. | ||
| // SF smem is linear: sf[kv_row * SF_COLS + hd_group], SF_COLS = HEAD_DIM_QK/16. | ||
| // For m16n8k16 B layout, thread t's KV rows are t/4 and t/4+8 in the mma_kv tile. | ||
| // b_frag[0,1] share KV row (t/4), b_frag[2,3] share KV row (t/4+8). | ||
| using DTypeQ_ = typename KTraits::DTypeQ; | ||
| using packed2_ = std::conditional_t<std::is_same_v<DTypeQ_, half>, half2, __nv_bfloat162>; | ||
| constexpr uint32_t SF_COLS_K = KTraits::NUM_MMA_D_QK; // HEAD_DIM_QK / 16 | ||
| uint32_t sf_base = (mma_kv * 16 + lane_idx / 4) * SF_COLS_K + mma_d; | ||
| __nv_fp8_e4m3 sf_a_fp8, sf_b_fp8; | ||
| sf_a_fp8.__x = k_sf_smem[sf_base]; | ||
| sf_b_fp8.__x = k_sf_smem[sf_base + 8 * SF_COLS_K]; | ||
| packed2_ scale_a{static_cast<DTypeQ_>(sf_a_fp8), static_cast<DTypeQ_>(sf_a_fp8)}; | ||
| packed2_ scale_b{static_cast<DTypeQ_>(sf_b_fp8), static_cast<DTypeQ_>(sf_b_fp8)}; | ||
| *(packed2_*)&b_frag[0] = __hmul2(*(packed2_*)&b_frag[0], scale_a); | ||
| *(packed2_*)&b_frag[1] = __hmul2(*(packed2_*)&b_frag[1], scale_a); | ||
| *(packed2_*)&b_frag[2] = __hmul2(*(packed2_*)&b_frag[2], scale_b); | ||
| *(packed2_*)&b_frag[3] = __hmul2(*(packed2_*)&b_frag[3], scale_b); | ||
| } |
There was a problem hiding this comment.
The logic for applying scaling factors in compute_qk is quite intricate, involving SF_COLS_K, sf_base, lane_idx, and __hmul2. Adding comments to explain how sf_base is calculated to correctly index into k_sf_smem and how lane_idx contributes to the thread mapping for scaling would significantly improve the readability and maintainability of this critical section.
| if constexpr (is_fp4_type_v<typename KTraits::DTypeKV>) { | ||
| // Apply scaling factors for V. | ||
| // SF smem is linear: sf[kv_row * SF_COLS + hd_group], SF_COLS = HEAD_DIM_VO/16. | ||
| // For transposed B (V), thread t's KV rows are 2*(t%4)+{0,1} and 2*(t%4)+{8,9} | ||
| // in the mma_kv tile. After swap, b_frag[0,2] cover rows {r0, r0+1} and | ||
| // b_frag[1,3] cover rows {r0+8, r0+9}. Each half2 needs two distinct SFs. | ||
| using DTypeQ_ = typename KTraits::DTypeQ; | ||
| using packed2_ = std::conditional_t<std::is_same_v<DTypeQ_, half>, half2, __nv_bfloat162>; | ||
| constexpr uint32_t SF_COLS_V = KTraits::NUM_MMA_D_VO; // HEAD_DIM_VO / 16 | ||
| uint32_t sf_base = (mma_kv * 16 + 2 * (lane_idx % 4)) * SF_COLS_V + mma_d; | ||
| __nv_fp8_e4m3 sf0_fp8, sf1_fp8, sf2_fp8, sf3_fp8; | ||
| sf0_fp8.__x = v_sf_smem[sf_base]; | ||
| sf1_fp8.__x = v_sf_smem[sf_base + SF_COLS_V]; | ||
| sf2_fp8.__x = v_sf_smem[sf_base + 8 * SF_COLS_V]; | ||
| sf3_fp8.__x = v_sf_smem[sf_base + 9 * SF_COLS_V]; | ||
| packed2_ scale_lo{static_cast<DTypeQ_>(sf0_fp8), static_cast<DTypeQ_>(sf1_fp8)}; | ||
| packed2_ scale_hi{static_cast<DTypeQ_>(sf2_fp8), static_cast<DTypeQ_>(sf3_fp8)}; | ||
| *(packed2_*)&b_frag[0] = __hmul2(*(packed2_*)&b_frag[0], scale_lo); | ||
| *(packed2_*)&b_frag[1] = __hmul2(*(packed2_*)&b_frag[1], scale_hi); | ||
| *(packed2_*)&b_frag[2] = __hmul2(*(packed2_*)&b_frag[2], scale_lo); | ||
| *(packed2_*)&b_frag[3] = __hmul2(*(packed2_*)&b_frag[3], scale_hi); | ||
| } |
There was a problem hiding this comment.
Similar to compute_qk, the scaling factor application in compute_sfm_v is complex. Clarifying the calculation of SF_COLS_V, sf_base, and the role of lane_idx in mapping to v_sf_smem would be very helpful. An explanation of how packed2_ scale_lo and scale_hi are constructed and applied would also enhance understanding.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
benchmarks/routines/attention.py (2)
1197-1201:⚠️ Potential issue | 🟠 MajorDon't quantize the paged NVFP4 cache twice.
Lines 1036-1059 already build packed
kv_cachepluskv_cache_sf. Lines 1197-1201 then feed that packeduint8data back intonvfp4_quantize_paged_kv_cache(...), which expects floating-point KV input and discards the scales you just computed. Reusekv_cache_sfaskv_block_scaleshere.Suggested change
- if is_nvfp4_kv: - kv_cache_nvfp4, kv_block_scales, k_scale, v_scale = ( - nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1]) - ) - kv_cache = kv_cache_nvfp4 + if use_nvfp4_kv: + kv_block_scales = kv_cache_sf🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/routines/attention.py` around lines 1197 - 1201, The code is re-quantizing an already-packed NVFP4 paged KV cache: when is_nvfp4_kv is true you call nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1]) on uint8-packed data and overwrite the correct scales; instead, reuse the precomputed packed cache and scales (kv_cache and kv_cache_sf) produced earlier: set kv_cache = kv_cache (or keep existing packed variable) and assign kv_block_scales = kv_cache_sf (and ensure k_scale and v_scale use the previously computed values), removing the nvfp4_quantize_paged_kv_cache call inside the is_nvfp4_kv branch so you don't discard the original scales.
865-867:⚠️ Potential issue | 🔴 CriticalUse one NVFP4 feature-flag name throughout this function.
Line 865 defines
is_nvfp4_kv, but the new branches later readuse_nvfp4_kv(for example Lines 883, 965, 1036, and 1218).testBatchPrefillWithPagedKVCacheWrapper()currently throws before any benchmark runs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/routines/attention.py` around lines 865 - 867, The function uses two inconsistent feature-flag names (is_nvfp4_kv and use_nvfp4_kv) causing branches to miss the intended flag; unify them by picking one name (e.g., replace the initial definition is_nvfp4_kv = args.kv_dtype == "nvfp4" with use_nvfp4_kv = args.kv_dtype == "nvfp4" or create use_nvfp4_kv = is_nvfp4_kv immediately after) and update all branch checks (references in the function such as the later conditionals at lines referencing use_nvfp4_kv) to use that single symbol so the NVFP4 path is consistently triggered (also ensure any dtype checks that set kv_dtype remain correct).include/flashinfer/attention/prefill.cuh (1)
1671-1702:⚠️ Potential issue | 🟠 MajorThese prefill paths now consume FP4 scale tiles without ever producing them.
compute_qk/compute_sfm_vnow dereferencek_sf_smemandv_sf_smemfor FP4 KV, butSinglePrefillWithKVCacheDeviceandBatchPrefillWithRaggedKVCacheKernelstill only callproduce_kv(...). If either path is instantiated with__nv_fp4x2_e2m1, it will multiply against uninitialized shared memory and silently corrupt the result. Please either plumb per-row SF loads into these kernels too, or add a compile-time guard that keeps FP4 limited to the paged path for now.Also applies to: 2118-2150
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/attention/prefill.cuh` around lines 1671 - 1702, The prefill paths call compute_qk and compute_sfm_v which read k_sf_smem/v_sf_smem for FP4 but the kernels SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel only call produce_kv and never populate those scale-factor tiles, so instantiating with __nv_fp4x2_e2m1 will read uninitialized shared memory; fix by adding a compile-time guard that prevents FP4 (e.g. static_assert or if constexpr) in SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel (or the wrapper that calls produce_kv) when KTraits::ScalarType == __nv_fp4x2_e2m1, OR alternatively plumb the per-row SF loads into those kernels by ensuring produce_kv is invoked with the SharedMemFillMode that fills k_sf_smem/v_sf_smem (or explicitly call the SF fill helper) before compute_qk/compute_sfm_v are executed; pick one approach and apply it consistently to both call sites referencing compute_qk, compute_sfm_v, k_sf_smem, v_sf_smem, and produce_kv.
🧹 Nitpick comments (4)
flashinfer/jit/attention/modules.py (1)
1836-1870: Factor the batch-attention setter generation out of this function.This copies the nullable/scalar assignment rules from
generate_additional_params(). A small helper that accepts the target prefix (params,params[i],params.additional_params) would keep the batch path from drifting the next time additional-parameter semantics change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/jit/attention/modules.py` around lines 1836 - 1870, Extract the logic that builds the param-assignment lines into a small helper (e.g., generate_additional_params_setter(prefix, additional_tensor_names, additional_tensor_dtypes, additional_scalar_names)) and replace the inline batch_additional_params_setter construction with a call to that helper; the helper should implement the same nullable tensor/ scalar rules currently duplicated (the conditional branch for var.startswith("maybe") and the scalar formatting) but use the provided target prefix (e.g., "params[i]", "params", or "params.additional_params") when formatting each assignment; update the call sites (the batch path that currently creates batch_additional_params_setter and any other place using generate_additional_params output) to call the new helper so semantics remain identical but the formatting logic is centralized..claude/memory/prefill_cuh_structure.md (1)
40-62: Call out the NVFP4 scale-factor path explicitly.This note still reads like a generic prefill overview. The new FP4-specific pieces—
maybe_k_cache_sf/maybe_v_cache_sf,page_produce_kv_sf, and the shared-memory scale buffers consumed bycompute_qk/compute_sfm_v—are exactly what future readers will look for in this PR.Based on learnings: Keep documentation in sync with code changes, particularly CLAUDE.md and
.claude/skills/when modifying infrastructure changes, patterns, new conventions, or deprecations.Also applies to: 75-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/memory/prefill_cuh_structure.md around lines 40 - 62, Update the prefill overview to explicitly document the NVFP4 scale-factor path: describe the new FP4-specific symbols maybe_k_cache_sf and maybe_v_cache_sf, the page_produce_kv_sf path, and the shared-memory scale buffers that compute_qk and compute_sfm_v consume; note where these are emitted/loaded and how they flow through page_produce_kv_sf → shared-memory buffers → compute_qk/compute_sfm_v, and add a short cross-reference to the infra docs/skills that must be updated when changing these conventions so readers can find the FP4 scale-factor behavior quickly.tests/attention/test_batch_attention.py (1)
308-309: Exercise signed E2M1 codes too.Line 309 clears both sign bits, so this test never covers negative NVFP4 values. A sign-handling regression would still pass here; either remove the mask or add a second signed-data case.
Suggested change
- packed &= 0x77 # clear bit 3 (0x08) and bit 7 (0x80) to ensure non-negative🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/attention/test_batch_attention.py` around lines 308 - 309, The test currently forces all packed bytes non-negative by applying "packed &= 0x77", so negative NVFP4 (signed E2M1) values are never exercised; either remove the mask expression "packed &= 0x77" to allow both signs, or add a second test case that constructs a signed-data variant (e.g., copy the existing "packed" and set the sign bits for NVFP4 by OR-ing the appropriate bits such as 0x08 and/or 0x80) and run the same assertions on that signed copy so both unsigned and negative NVFP4 paths are covered.tests/attention/test_batch_prefill_kernels.py (1)
1074-1083: Add at least one causal NVFP4 case.This matrix hardcodes
causal=False, so the new NVFP4 path never exercises the masked/tail-tile logic that changed in the kernel code. A smallcausal=Truecase would cover the scale-factor path under masking as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/attention/test_batch_prefill_kernels.py` around lines 1074 - 1083, The test matrix currently forces causal=False so the NVFP4 kernel path never hits masked/tail-tile logic; update the test_batch_prefill_with_paged_kv_cache_nvfp4 parameterization (the `@pytest.mark.parametrize`("causal", ...) on the test) to include True (e.g., [False, True]) so at least one run exercises the causal/masked path for NVFP4; keep existing q_dtype values unchanged so the NVFP4 path is still exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/routines/attention.py`:
- Around line 1667-1675: The NVFP4 allowlist is too permissive: in the
use_nvfp4_kv block (variable backends and list nvfp4_unsupported) remove "auto"
and "trtllm-native" from the supported set so ragged NVFP4 only stays enabled
for backends that actually wrap ragged K/V (e.g., "fa2" and "fa3"); update the
allowed list used to compute nvfp4_unsupported from ["fa2", "trtllm-native",
"auto"] to only ["fa2", "fa3"] (or the concrete backends that implement ragged
NVFP4) so backends that don't forward k_sf/v_sf are filtered out.
In `@include/flashinfer/cp_async.cuh`:
- Around line 191-224: The cp.async call in pred_load_128b_from_64b uses
cp_size=8 which doesn’t zero the upper 8 bytes; change the cp.async
invocation(s) to use cp_size=16 and pass src_size as a variable (src_size =
predicate ? 8 : 0) so cp.async zero-fills bytes 8..15 when src_size is 0, and
similarly adjust the kNoFill branch to issue cp.async with cp_size=16 and
src_size conditionally 8 or 0 (instead of cp_size=8); also apply the same fix
pattern to the 32b helper described in the comment: use cp_size appropriate to
the full destination (e.g., 16 for 128b destination or 4 for 32b helper) and
make src_size variable (0 when wanting explicit zero-fill, nonzero when copying)
so cp.async actually zeros the upper bytes.
In `@include/flashinfer/vec_dtypes.cuh`:
- Around line 486-510: The CUDA version gate incorrectly requires both
__CUDACC_VER_MAJOR__ >= 13 and __CUDACC_VER_MINOR__ >= 2, which fails for CUDA
14.x; update the preprocessor condition that guards the fast-path asm (the block
using cvt.rn.bf16x2.e2m1x2 and variable y/b) to check the combined version
(e.g., compare major/minor together or compute a single numeric version) so it
enables the fast path for CUDA >= 13.2 (including 14.0+).
---
Outside diff comments:
In `@benchmarks/routines/attention.py`:
- Around line 1197-1201: The code is re-quantizing an already-packed NVFP4 paged
KV cache: when is_nvfp4_kv is true you call
nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1]) on uint8-packed
data and overwrite the correct scales; instead, reuse the precomputed packed
cache and scales (kv_cache and kv_cache_sf) produced earlier: set kv_cache =
kv_cache (or keep existing packed variable) and assign kv_block_scales =
kv_cache_sf (and ensure k_scale and v_scale use the previously computed values),
removing the nvfp4_quantize_paged_kv_cache call inside the is_nvfp4_kv branch so
you don't discard the original scales.
- Around line 865-867: The function uses two inconsistent feature-flag names
(is_nvfp4_kv and use_nvfp4_kv) causing branches to miss the intended flag; unify
them by picking one name (e.g., replace the initial definition is_nvfp4_kv =
args.kv_dtype == "nvfp4" with use_nvfp4_kv = args.kv_dtype == "nvfp4" or create
use_nvfp4_kv = is_nvfp4_kv immediately after) and update all branch checks
(references in the function such as the later conditionals at lines referencing
use_nvfp4_kv) to use that single symbol so the NVFP4 path is consistently
triggered (also ensure any dtype checks that set kv_dtype remain correct).
In `@include/flashinfer/attention/prefill.cuh`:
- Around line 1671-1702: The prefill paths call compute_qk and compute_sfm_v
which read k_sf_smem/v_sf_smem for FP4 but the kernels
SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel only call
produce_kv and never populate those scale-factor tiles, so instantiating with
__nv_fp4x2_e2m1 will read uninitialized shared memory; fix by adding a
compile-time guard that prevents FP4 (e.g. static_assert or if constexpr) in
SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel (or the
wrapper that calls produce_kv) when KTraits::ScalarType == __nv_fp4x2_e2m1, OR
alternatively plumb the per-row SF loads into those kernels by ensuring
produce_kv is invoked with the SharedMemFillMode that fills k_sf_smem/v_sf_smem
(or explicitly call the SF fill helper) before compute_qk/compute_sfm_v are
executed; pick one approach and apply it consistently to both call sites
referencing compute_qk, compute_sfm_v, k_sf_smem, v_sf_smem, and produce_kv.
---
Nitpick comments:
In @.claude/memory/prefill_cuh_structure.md:
- Around line 40-62: Update the prefill overview to explicitly document the
NVFP4 scale-factor path: describe the new FP4-specific symbols maybe_k_cache_sf
and maybe_v_cache_sf, the page_produce_kv_sf path, and the shared-memory scale
buffers that compute_qk and compute_sfm_v consume; note where these are
emitted/loaded and how they flow through page_produce_kv_sf → shared-memory
buffers → compute_qk/compute_sfm_v, and add a short cross-reference to the infra
docs/skills that must be updated when changing these conventions so readers can
find the FP4 scale-factor behavior quickly.
In `@flashinfer/jit/attention/modules.py`:
- Around line 1836-1870: Extract the logic that builds the param-assignment
lines into a small helper (e.g., generate_additional_params_setter(prefix,
additional_tensor_names, additional_tensor_dtypes, additional_scalar_names)) and
replace the inline batch_additional_params_setter construction with a call to
that helper; the helper should implement the same nullable tensor/ scalar rules
currently duplicated (the conditional branch for var.startswith("maybe") and the
scalar formatting) but use the provided target prefix (e.g., "params[i]",
"params", or "params.additional_params") when formatting each assignment; update
the call sites (the batch path that currently creates
batch_additional_params_setter and any other place using
generate_additional_params output) to call the new helper so semantics remain
identical but the formatting logic is centralized.
In `@tests/attention/test_batch_attention.py`:
- Around line 308-309: The test currently forces all packed bytes non-negative
by applying "packed &= 0x77", so negative NVFP4 (signed E2M1) values are never
exercised; either remove the mask expression "packed &= 0x77" to allow both
signs, or add a second test case that constructs a signed-data variant (e.g.,
copy the existing "packed" and set the sign bits for NVFP4 by OR-ing the
appropriate bits such as 0x08 and/or 0x80) and run the same assertions on that
signed copy so both unsigned and negative NVFP4 paths are covered.
In `@tests/attention/test_batch_prefill_kernels.py`:
- Around line 1074-1083: The test matrix currently forces causal=False so the
NVFP4 kernel path never hits masked/tail-tile logic; update the
test_batch_prefill_with_paged_kv_cache_nvfp4 parameterization (the
`@pytest.mark.parametrize`("causal", ...) on the test) to include True (e.g.,
[False, True]) so at least one run exercises the causal/masked path for NVFP4;
keep existing q_dtype values unchanged so the NVFP4 path is still exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c342064-89d2-4570-b892-646302704193
📒 Files selected for processing (18)
.claude/memory/MEMORY.md.claude/memory/prefill_cuh_structure.mdbenchmarks/routines/attention.pyflashinfer/attention.pyflashinfer/jit/attention/modules.pyflashinfer/jit/utils.pyflashinfer/prefill.pyflashinfer/quantization/fp4_quantization.pyflashinfer/utils.pyinclude/flashinfer/attention/persistent.cuhinclude/flashinfer/attention/prefill.cuhinclude/flashinfer/cp_async.cuhinclude/flashinfer/frag_layout_swizzle.cuhinclude/flashinfer/permuted_smem.cuhinclude/flashinfer/vec_dtypes.cuhmha_ref.cutests/attention/test_batch_attention.pytests/attention/test_batch_prefill_kernels.py
👮 Files not reviewed due to content moderation or server errors (7)
- flashinfer/utils.py
- .claude/memory/MEMORY.md
- flashinfer/jit/utils.py
- flashinfer/attention.py
- include/flashinfer/permuted_smem.cuh
- flashinfer/prefill.py
- flashinfer/quantization/fp4_quantization.py
benchmarks/routines/attention.py
Outdated
| # NVFP4 KV is only supported by fa2/fa3/auto backends | ||
| if use_nvfp4_kv: | ||
| nvfp4_unsupported = [ | ||
| b for b in backends if b not in ["fa2", "trtllm-native", "auto"] | ||
| ] | ||
| for b in nvfp4_unsupported: | ||
| print(f"[INFO] Backend {b} does not support NVFP4 KV cache. Skipping.") | ||
| backends.remove(b) | ||
|
|
There was a problem hiding this comment.
Ragged NVFP4 should not keep auto or trtllm-native in the allowlist.
auto never gets a wrapper in Lines 1843-1896, and the trtllm-native branch at Lines 1975-1998 runs packed uint8 K/V without ever forwarding k_sf / v_sf. Leaving either backend here turns --kv_dtype nvfp4 into an unsupported benchmark path.
Suggested change
- nvfp4_unsupported = [
- b for b in backends if b not in ["fa2", "trtllm-native", "auto"]
- ]
+ nvfp4_unsupported = [b for b in backends if b not in ["fa2"]]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/routines/attention.py` around lines 1667 - 1675, The NVFP4
allowlist is too permissive: in the use_nvfp4_kv block (variable backends and
list nvfp4_unsupported) remove "auto" and "trtllm-native" from the supported set
so ragged NVFP4 only stays enabled for backends that actually wrap ragged K/V
(e.g., "fa2" and "fa3"); update the allowed list used to compute
nvfp4_unsupported from ["fa2", "trtllm-native", "auto"] to only ["fa2", "fa3"]
(or the concrete backends that implement ragged NVFP4) so backends that don't
forward k_sf/v_sf are filtered out.
| template <PrefetchMode prefetch_mode, SharedMemFillMode fill_mode, typename T> | ||
| __device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr, | ||
| bool predicate) { | ||
| #ifdef FLASHINFER_CP_ASYNC_ENABLED | ||
| uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); | ||
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | ||
| int src_in_bytes = predicate ? 8 : 0; | ||
| asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), | ||
| "l"(gmem_ptr), "n"(8), "r"(src_in_bytes)); | ||
|
|
||
| } else { | ||
| // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes | ||
| asm volatile( | ||
| "{\n" | ||
| " .reg .pred p;\n" | ||
| " setp.ne.b32 p, %0, 0;\n" | ||
| " @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n" | ||
| "}\n" ::"r"((int)predicate), | ||
| "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8)); | ||
| } | ||
| #else | ||
| if (predicate) { | ||
| uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr); | ||
| smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr); | ||
| // smem_u64[1] = 0xcdcdcdcdcdcdcdcd; | ||
| smem_u64[1] = 0; | ||
| } else { | ||
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | ||
| *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0); | ||
| // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef); | ||
| } | ||
| } | ||
| #endif | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the actual file content to see the exact implementation
head -n 250 include/flashinfer/cp_async.cuh | tail -n 70Repository: flashinfer-ai/flashinfer
Length of output: 3192
🌐 Web query:
PTX ISA cp.async.ca.shared.global instruction semantics src-size cp-size parameters
💡 Result:
The PTX ISA instruction cp.async.ca.shared.global performs an asynchronous copy from global memory to shared memory with cache allocation hints. It is available on SM_80+ architectures (Ampere+). Syntax examples from LLVM NVVM intrinsics and forum discussions: - cp.async.ca.shared.global.4 [dst_shared], [src_global], src_size - cp.async.ca.shared.global.8 [dst_shared], [src_global], src_size - cp.async.ca.shared.global.16 [dst_shared], [src_global], src_size Semantics: - Asynchronous: The copy is non-blocking; the thread continues execution immediately. Completion requires explicit synchronization like cp.async.wait_all or mbarrier. - Direction: Global (.global) source to shared (.shared) destination. - Cache hints: ca = cache-at-level (allocates cache lines in L2), specific to global reads. - Size variants: 4, 8, or 16 bytes per instruction. src-size and cp-size parameters: - src-size (third operand): Specifies the actual source data size in bytes for prefetching/optimization. Optional in some intrinsics but required for precise hardware behavior (e.g., LLVM nvvm_cp_async_ca_shared_global_4 takes src_size). Matches transfer size (4/8/16) for aligned copies but allows optimization hints. - cp-size: Likely refers to copy size, same as the instruction variant (4/8/16 bytes). Not a separate parameter; encoded in opcode. The operation queues the copy in hardware pipelines for overlap with computation. Use with shared memory barriers or wait instructions for consumer threads. For bulk copies, use cp.async.bulk variants (SM_90+). Sources confirm usage in forums (1,6,11), LLVM docs (8,18), PTX ISA docs (2,3,4,12,19), and CUDA guides (13,22).
Citations:
- 1: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219
- 2: https://docs.nvidia.com/cuda/parallel-thread-execution/
- 3: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=st+async
- 4: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.5.pdf
- 5: https://forums.developer.nvidia.com/t/ptx-cp-async-purpose-of-ignore-src-and-prefetch-size/217746
- 6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
- 7: https://llvm.org/docs/NVPTXUsage.html
- 8: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979
- 9: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async
- 10: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 11: https://reviews.llvm.org/D150820?id=523455
- 12: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf
- 13: https://docs.nvidia.com/cuda/cuda-programming-guide/03-advanced/advanced-kernel-programming.html
🏁 Script executed:
# Let's search for the actual PTX documentation reference or any existing implementation
# that uses cp.async in the codebase to understand the pattern
rg "cp\.async" include/flashinfer/ --type cuh --type h -A 2 -B 2Repository: flashinfer-ai/flashinfer
Length of output: 95
🌐 Web query:
PTX ISA cp.async zero-fill semantics src-size parameter must be less than copy size
💡 Result:
In NVIDIA PTX ISA, the cp.async instructions (e.g., cp.async.cg.shared.global) support an optional src_size parameter for zero-fill semantics. When src_size is specified and is less than the copy size (typically 16 bytes), the instruction copies src_size bytes from the source and zero-fills the remaining bytes in the destination. This allows handling unaligned or partial copies while ensuring the full destination granularity (16 bytes) is written with zeros in the unused portion, preventing data leakage or corruption in shared memory. This feature is used in high-performance kernels for efficient asynchronous global-to-shared copies, especially with predicates or odd-sized tiles, as seen in CUTLASS and CCCL implementations (e.g., cp_async_zfill traits). The requirement ensures valid operation without reading beyond source bounds, aligning with hardware constraints on Ampere (SM_80+) and later architectures. Official PTX docs confirm cp.async variants with src_size for zero behavior, though exact syntax is in instruction set details (PTX ISA 8.5+). Primary sources: NVIDIA PTX ISA docs (versions 8.5-9.2), CUTLASS issues/PRs on cp.async.zfill, LLVM NVPTX intrinsics adding src_size to nvvm_cp_async*, NVIDIA CCCL cp.async.bulk implementations requiring 16-byte multiples with alignment.
Citations:
- 1: https://docs.nvidia.com/cuda/parallel-thread-execution/
- 2: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async
- 3: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf
- 4: https://docs.nvidia.com/cuda/parallel-thread-execution/contents.html
- 5: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
- 6: https://ita9naiwa.github.io/mlsys/2025/10/05/ptx-mental-model.html
- 7: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 8: cp async fallback NVIDIA/cutlass#761
- 9: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716
- 10: https://reviews.llvm.org/D150820?id=523455
🌐 Web query:
NVIDIA PTX cp.async.ca.shared.global src-size equal to copy-size valid
💡 Result:
Yes, for the NVIDIA PTX instruction cp.async.ca.shared.global, the source size (determined by the operand type, e.g., .u16 for 16 bytes, .u8 for 8 bytes) must equal the copy size. This instruction performs fixed-size asynchronous copies from shared memory to global memory, where the copy granularity is specified by the type size. Using mismatched sizes is invalid and may lead to undefined behavior or assembly errors. This follows the standard PTX async copy semantics where operand type dictates transfer size, as seen in related cp.async instructions and forum discussions on size requirements (e.g., 16-byte minimum for certain variants). Official PTX ISA documentation confirms fixed-size transfers for these operations, with no variable src-size parameter in the syntax.
Citations:
- 1: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219
- 2: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443
- 3: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 4: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979
- 5: https://reviews.llvm.org/D100394
🏁 Script executed:
# Let's search for actual documentation or CUTLASS usage patterns
rg "cp\.async" include/flashinfer/ -A 2 -B 2Repository: flashinfer-ai/flashinfer
Length of output: 9023
🌐 Web query:
NVIDIA PTX ISA 8.8 cp.async.ca.shared.global instruction specification detailed syntax
💡 Result:
The NVIDIA PTX ISA 8.8 documentation is available as a PDF at https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf (published 2025-05-31). The cp.async.ca.shared.global instruction is an asynchronous copy operation from global memory to shared memory, introduced in PTX ISA for compute capability 8.0+ (Hopper and later architectures like sm_90). Syntax: cp.async.ca.shared.global [<dst_shared_address>], [<src_global_address>], ; - ca: cache allocation hint (cache at L2). - shared.global: specifies destination in shared memory, source in global memory. - It queues an asynchronous transfer; use cp.async.commit_group, cp.async.wait_group, or cp.async.wait_all to manage groups and synchronization. - Only supports global-to-shared direction (not shared-to-global for this variant). - Detailed semantics, modifiers, and examples are in PTX ISA 8.8 section 9.7.9 "Data Movement and Conversion Instructions" subsection on asynchronous copies (cp.async family), around 9.7.9.25.3.1. HTML docs at https://docs.nvidia.com/cuda/parallel-thread-execution/ reference the same content, with contents listing relevant sections like 9.7.9.25 Asynchronous Copies. Forums confirm usage for global-to-shared async loads with cache hints.
Citations:
- 1: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf
- 2: https://docs.nvidia.com/cuda/hopper-tuning-guide/parallel-thread-execution/contents.html
- 3: https://docs.nvidia.com/cuda/hopper-tuning-guide/parallel-thread-execution/index.html
- 4: https://www.aidoczh.com/cuda/parallel-thread-execution/contents.html
- 5: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 6: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219
- 7: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979
- 8: https://stackoverflow.com/questions/66145682/how-do-compute-capabilities-7-x-8-x-assist-cooperative-group-operations
🌐 Web query:
CUTLASS cp.async.zfill implementation copy size src size parameter semantics
💡 Result:
In CUTLASS's CuTe library, cp.async.zfill refers to asynchronous copy operations using PTX cp.async.*.zfill instructions, wrapped in Copy_Atom traits like SM80_CP_ASYNC_CACHEGLOBAL_ZFILL. The copy size is determined by the source tensor partition size (thr_copy.partition_S(tG)) and destination partition size (thr_copy.partition_D(tS)), typically vector types like uint128_t for 16 bytes. For non-zfill variants (e.g., SM80_CP_ASYNC_CACHEGLOBAL), when used with predication (copy_if), there's a bug where it silently re-dispatches to the _ZFILL trait (lines 77-82 in copy_traits_sm80.hpp), causing zero-filling of predicated-out destination locations regardless of predication, leading to buffer overwriting. ZFILL variants explicitly zero-fill the destination beyond the source data extent, using the guard bit in cp.async.zfill to specify bytes to write vs. zero. The src size parameter semantics: the copy atom loads the full vector from source (if available), but zfill ensures destination lanes beyond source valid data are zeroed. Predication should mask entire copies, but the bug causes zfill behavior even without _ZFILL. Workaround: manually loop with if(p(i)) copy(...) or use ZFILL explicitly if zero-filling is desired. Primary sources confirm the implementation details and bug behavior.
Citations:
- 1: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716
- 2: https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/copy_traits_sm90_im2col.hpp
- 3: fix Copy_Traits<SM80_CP_ASYNC_*_ZFILL> without definition of with(pred) function NVIDIA/cutlass#1887
- 4: cp async fallback NVIDIA/cutlass#761
- 5: https://ipd.graylab.jhu.edu/rfdiffusion2/cutlass-3.5.1/include/cute/atom/
🏁 Script executed:
# Let's check how cp.async.cg (the working variant) differs from cp.async.ca
cat -n include/flashinfer/cp_async.cuh | sed -n '80,160p'Repository: flashinfer-ai/flashinfer
Length of output: 3793
🌐 Web query:
PTX ISA cp.async.ca vs cp.async.cg cache hints semantics differences
💡 Result:
In NVIDIA PTX ISA, cp.async is an asynchronous copy instruction used primarily for efficient data movement between global memory and shared memory in CUDA kernels, introduced in Ampere (SM_80+) architectures. The suffixes .ca and .cg are cache hint operators applied to the source operand (typically global memory). cp.async.ca.shared.global [shared_dest], [global_src], size; - .ca: Cache at L2 (request data to be cached in L2 cache). Suitable for data reuse across warps or blocks where L2 hit is beneficial. cp.async.cg.shared.global [shared_dest], [global_src], size; - .cg: Cache at global (streaming load to L1/L2 but with global caching policy, bypassing or weakly caching in L1, streaming semantics). Used for sequential/streaming accesses with less reuse, reducing L1 pollution. Key differences in semantics: - .ca prefers L2 caching for better hit rates in reuse scenarios. - .cg optimizes for streaming, minimizing cache pressure in L1 while still leveraging L2. These hints guide hardware prefetching and caching behavior but are advisory; actual behavior depends on GPU architecture (Ampere, Hopper). Behavior may differ across architectures (e.g., Ampere vs Hopper as noted in forums). For precise details, refer to PTX ISA documentation section on Data Movement Instructions and Cache Operators (9.7.9.1). cp.async.bulk variants exist for larger transfers with additional qualifiers.
Citations:
- 1: https://docs.nvidia.com/cuda/parallel-thread-execution/
- 2: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async
- 3: https://research.meekolab.com/messing-around-with-gpus-again
- 4: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443
- 5: https://docs.nvidia.com/cuda/parallel-thread-execution/contents.html
- 6: https://docs.nvidia.com/cuda/cuda-programming-guide/03-advanced/advanced-kernel-programming.html
- 7: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-cg-performs-differently-on-ampere-and-hopper/298612
🌐 Web query:
"cp.async" "src_in_bytes" "cp_size" PTX semantics zero fill behavior specification
💡 Result:
The PTX instruction "cp.async" is NVIDIA's asynchronous copy instruction for GPU memory operations, introduced in SM_80+ architectures (Ampere and later). It enables non-blocking data transfers, typically from global to shared memory, overlapping compute with memory copies. Higher-level CUDA APIs like cuda::memcpy_async compile to cp.async PTX under the hood. Key semantics from PTX ISA and related docs: - Syntax examples: cp.async[.cg][.collective].shared.global[.collective_type][.sem][.wait_id][.noinc][.nocomplete][.src_in_bytes][.cp_size][.tenant] [dst], [src], src_size, [mbarrier]; - .src_in_bytes: Specifies source size in bytes (allows flexible sizing). - .cp_size: Likely specifies copy size or related parameter (exact PTX docs needed for full syntax). - Operations are asynchronous; completion tracked via mbarrier (memory barrier) objects, with instructions like cp.async.mbarrier.arrive, cp.async.wait_group, etc. - Variants include cp.async.bulk for bulk transfers, cp.async.bulk.tensor for tensor memory access, with qualifiers like .mbarrier::complete_tx::bytes, .multicast::cluster (SM_90a+). Zero fill behavior: No explicit mention in available sources of automatic zero-filling for cp.async. Unlike some bulk copy instructions (e.g., certain TMA or cp.async.bulk with .ignore_oob), standard cp.async copies the specified bytes without documented zero-padding or zero-fill for untransferred portions. If source size < destination allocation or partial copies occur, destination may retain prior contents (garbage) unless explicitly initialized. For zero-initialization, use separate memzero or cp.async with zero source. Confirm with latest PTX ISA 9.2 docs section on "Asynchronous Copy Instructions" for precise behavior on misalignment/partial copies. Authoritative sources: NVIDIA PTX ISA 9.2 (docs.nvidia.com/cuda/parallel-thread-execution/), CUDA Programming Guide async copies section.
Citations:
- 1: https://docs.nvidia.com/cuda/parallel-thread-execution/
- 2: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
- 3: https://github.com/HazyResearch/ThunderKittens/pull/97/files
- 4: https://reviews.llvm.org/D100394
- 5: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async
- 6: https://reviews.llvm.org/D150820?id=523455
- 7: https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg232455.html
🏁 Script executed:
# Let me look at the complete pred_load_32b function to see the exact implementation
cat -n include/flashinfer/cp_async.cuh | sed -n '226,270p'Repository: flashinfer-ai/flashinfer
Length of output: 1967
🌐 Web query:
NVIDIA PTX ISA cp.async "src_size must be less than copy size" zero fill constraint specification
💡 Result:
The NVIDIA PTX ISA cp.async instructions (introduced in SM_80/Ampere and later) perform asynchronous copies, typically from global to shared memory. Variants like cp.async.zfill (or _ZFILL in CUTLASS) zero-fill the destination beyond the available source data. No official NVIDIA PTX documentation explicitly states a constraint "src_size must be less than copy size". Searches across PTX ISA docs (v8.5, v8.8, v9.2 HTML/PDF), CUDA Programming Guide, and developer forums did not yield this exact phrase or detailed src_size parameter spec. However, evidence from libraries and tools indicates related constraints on copy granularity for zero-fill modes: - cp.async requires minimum transfer sizes: 4 bytes minimum (Triton-lang error for 2 bytes [6]); 16 bytes for certain modifiers like .cg [3]. - CUTLASS cp_async_zfill falls back to synchronous copies for 1-2 byte accesses, implying hardware cp.async.zfill expects ElementSize >=4 bytes (4,8,16) [7]. - LLVM/Clang NVPTX intrinsics note optional src_size argument in later CUDA 11+ [15]. - cp.async.bulk requires 16-byte alignment and size multiple of 16 [1,8]. The constraint likely refers to a compiler/assembler validation or library assertion in zero-fill variants (e.g., CUTLASS), where src_size (available valid source bytes) must be < copy_size (fixed hardware granule, e.g., 16B) to trigger zero-filling the remainder. If src_size == copy_size, use non-zfill cp.async; if src_size > copy_size, invalid/overflow. This ensures proper zero-padding for misaligned/trailing data without source reads beyond bounds. For authoritative spec, consult latest PTX ISA PDF (e.g., ptx_isa_9.2.pdf "Asynchronous Copy" section in Data Movement Instructions, 9.7.15 or similar), as HTML versions lack full instruction details. Forum [2] discusses ignore-src (ignores bytes > src_size, zero-fills), supporting partial-src semantics.
Citations:
- 1: https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
- 2: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443
- 3: https://docs.nvidia.com/cuda/parallel-thread-execution/
- 4: error: cp.async does not support transfers smaller than 4 bytes (2) triton-lang/triton#8023
- 5: cp async fallback NVIDIA/cutlass#761
- 6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
- 7: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 8: https://reviews.llvm.org/D100394
- 9: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf
🏁 Script executed:
# Check git history or tests for pred_load_128b to see if this pattern is validated
git log --oneline include/flashinfer/cp_async.cuh | head -20Repository: flashinfer-ai/flashinfer
Length of output: 77
🏁 Script executed:
# Search for actual usage of these functions to understand context
rg "pred_load_128b|pred_load_32b" include/flashinfer/ --type c --type cpp --type h | head -30Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Let's search for actual usage of pred_load_128b to see if it's tested or used
rg "pred_load_128b\|pred_load_32b" --type-add "cuda:*.cu" --type-add "cuh:*.cuh" -B 2 -A 2Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check if there are any tests that validate the cp.async behavior
find . -name "*test*" -type f | grep -i "cp_async\|async" | head -10Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Let's verify the exact semantics by looking at the comment in the kNoFill branch
# which explicitly states what cp.async does
cat -n include/flashinfer/cp_async.cuh | sed -n '210,225p'Repository: flashinfer-ai/flashinfer
Length of output: 619
🏁 Script executed:
# Check if there are any related issues or PRs discussing this pattern
git log --all --oneline --grep="cp.async\|pred_load" 2>/dev/null | head -20Repository: flashinfer-ai/flashinfer
Length of output: 223
🏁 Script executed:
# Let's check the commit that mentions "zero fill for cp.async" to understand the semantics
git show 541d25b --stat | head -20Repository: flashinfer-ai/flashinfer
Length of output: 323
🏁 Script executed:
# Check the diff of that commit to see what was changed
git show 541d25b | head -100Repository: flashinfer-ai/flashinfer
Length of output: 1504
🏁 Script executed:
# Let's also check the most recent commit that modified this file
git log --oneline -1 include/flashinfer/cp_async.cuhRepository: flashinfer-ai/flashinfer
Length of output: 77
🏁 Script executed:
# Check when pred_load_128b_from_64b was added
git log --oneline include/flashinfer/cp_async.cuh | grep -i "load_128b\|from_64b" || git log -p include/flashinfer/cp_async.cuh | grep -B 5 "pred_load_128b_from_64b" | head -20Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Let's look at the actual line numbers and comments more carefully
cat -n include/flashinfer/cp_async.cuh | sed -n '180,200p'Repository: flashinfer-ai/flashinfer
Length of output: 1366
🏁 Script executed:
# Let me check the comment more carefully - the code says "cp.async always zeros the upper 8 bytes"
# This is a critical claim about cp.async behavior
cat -n include/flashinfer/cp_async.cuh | sed -n '200,215p'Repository: flashinfer-ai/flashinfer
Length of output: 739
🌐 Web query:
cp.async "always zeros" upper bytes PTX behavior semantics
💡 Result:
In NVIDIA PTX, the cp.async instruction family performs asynchronous copies, typically from global to shared memory, available on SM 8.0+ (Ampere+). Standard cp.async variants (e.g., cp.async.ca.shared.global [.b32|.b64|.b128]) require source and destination alignments matching the vector size (4/8/16 bytes), and copies exact byte counts specified. Misalignment or smaller sizes lead to PTX assembly errors or fallback behaviors in libraries. A key semantic is the ZFILL variant (e.g., cp.async.cg.shared.global.zfill [.b32|.b64]), which always zero-fills the upper unused bytes in the destination vector when the source provides fewer bytes than the vector width. This is used in libraries like CUTLASS for handling unaligned or sub-vector copies safely, preventing buffer overwrites or garbage data. For example, in CUTLASS Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL_ZFILL<...>>, ZFILL ensures upper bytes are zeroed, while non-ZFILL may silently dispatch to ZFILL or cause overwriting issues with predicates (as noted in CUTLASS issues #1716, #1768). Non-ZFILL cp.async expects full vector loads; partial loads trigger errors like "unexpected value '8', expected 16". PTX requires 16-byte alignment for many async copies, with size multiples of vector width. Upper bytes behavior: ZFILL explicitly zeros them ("always zeros"); standard variants copy source data exactly or undefined on misalignment. This matches developer observations in CUTLASS PR #761 (fallbacks for 1-2 byte cases) and issues where non-ZFILL overwrites with zeros unexpectedly due to trait dispatch.
Citations:
- 1: cp async fallback NVIDIA/cutlass#761
- 2: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716
- 3: fix Copy_Traits<SM80_CP_ASYNC_*_ZFILL> without definition of with(pred) function NVIDIA/cutlass#1887
- 4: [QST] SM80_CP_ASYNC_CACHEGLOBAL doesn't allow anything but 128 bit NVIDIA/cutlass#1768
- 5: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html
- 6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html
🏁 Script executed:
# Let me check if there's a .zfill variant mentioned anywhere in the file
rg "zfill|ZFILL" include/flashinfer/cp_async.cuhRepository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check what the actual difference is between the kFillZero and kNoFill branches
# and understand the intended zero-fill semantics
cat -n include/flashinfer/cp_async.cuh | sed -n '191,224p' | cat -ARepository: flashinfer-ai/flashinfer
Length of output: 1786
cp.async copy size is incorrect; the 64b helper cannot zero the upper half of the 128b destination.
PTX cp.async only zero-fills bytes beyond the declared copy size (cp_size), so cp_size=8 with src_size=8 cannot zero bytes 8-15 of the destination. The 64b helper needs cp_size=16 with variable src_size (8 when predicate is true, 0 when false) to properly load 64 bits and zero-fill the upper 64 bits. Similarly, the 32b helper's fill_mode=kFillZero branch needs restructuring: use cp_size=4 for non-zero predicate paths, and cp_size=4, src_size=0 for explicit zero-fill when predicate is false.
Suggested fix sketch
template <PrefetchMode prefetch_mode, SharedMemFillMode fill_mode, typename T>
__device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr,
bool predicate) {
`#ifdef` FLASHINFER_CP_ASYNC_ENABLED
uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
int src_in_bytes = predicate ? 8 : 0;
- asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
- "l"(gmem_ptr), "n"(8), "r"(src_in_bytes));
+ asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
+ "l"(gmem_ptr), "n"(16), "r"(src_in_bytes));
} else {
asm volatile(
"{\n"
" .reg .pred p;\n"
" setp.ne.b32 p, %0, 0;\n"
" `@p` cp.async.ca.shared.global [%1], [%2], %3, %4;\n"
"}\n" ::"r"((int)predicate),
- "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8));
+ "r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8));
}
`#else`
...
`#endif`
}
template <SharedMemFillMode fill_mode>
__device__ __forceinline__ void pred_load_32b(uint32_t* smem_ptr, const uint32_t* gmem_ptr,
bool predicate) {
`#ifdef` FLASHINFER_CP_ASYNC_ENABLED
uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
- int src_in_bytes = predicate ? 4 : 0;
- asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
- "l"(gmem_ptr), "n"(4), "r"(src_in_bytes));
+ if (predicate) {
+ asm volatile("cp.async.ca.shared.global [%0], [%1], %2;\n" ::"r"(smem_int_ptr),
+ "l"(gmem_ptr), "n"(4));
+ } else {
+ asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
+ "l"(gmem_ptr), "n"(4), "r"(0));
+ }
} else {
...
}
`#else`
...
`#endif`
}Also applies to: 236-244
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/cp_async.cuh` around lines 191 - 224, The cp.async call in
pred_load_128b_from_64b uses cp_size=8 which doesn’t zero the upper 8 bytes;
change the cp.async invocation(s) to use cp_size=16 and pass src_size as a
variable (src_size = predicate ? 8 : 0) so cp.async zero-fills bytes 8..15 when
src_size is 0, and similarly adjust the kNoFill branch to issue cp.async with
cp_size=16 and src_size conditionally 8 or 0 (instead of cp_size=8); also apply
the same fix pattern to the 32b helper described in the comment: use cp_size
appropriate to the full destination (e.g., 16 for 128b destination or 4 for 32b
helper) and make src_size variable (0 when wanting explicit zero-fill, nonzero
when copying) so cp.async actually zeros the upper bytes.
| #if (defined __CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 13) && \ | ||
| (defined __CUDACC_VER_MINOR__) && (__CUDACC_VER_MINOR__ >= 2) | ||
| // cvt.rn.bf16x2.e2m1x2 requires CUDA Toolkit >= 13.2 | ||
| asm volatile( | ||
| "{\n" | ||
| ".reg .b8 fp4_byte;\n" | ||
| "mov.b32 {fp4_byte, _, _, _}, %1;\n" | ||
| "cvt.rn.bf16x2.e2m1x2 %0, fp4_byte;\n" | ||
| "}" | ||
| : "=r"(y) | ||
| : "r"(b)); | ||
| #else | ||
| // Fallback: convert e2m1 -> fp16 -> bf16 when cvt.rn.bf16x2.e2m1x2 is unavailable | ||
| uint32_t fp16x2; | ||
| asm volatile( | ||
| "{\n" | ||
| ".reg .b8 fp4_byte;\n" | ||
| "mov.b32 {fp4_byte, _, _, _}, %1;\n" | ||
| "cvt.rn.f16x2.e2m1x2 %0, fp4_byte;\n" | ||
| "}" | ||
| : "=r"(fp16x2) | ||
| : "r"(b)); | ||
| __half2 h2 = reinterpret_cast<__half2&>(fp16x2); | ||
| __nv_bfloat162 bf16x2 = __float22bfloat162_rn(__half22float2(h2)); | ||
| y = reinterpret_cast<uint32_t&>(bf16x2); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For NVCC, what values do __CUDACC_VER_MAJOR__ and __CUDACC_VER_MINOR__ take on CUDA 14.0+, and what preprocessor expression is recommended for checking "CUDA Toolkit >= 13.2"?
💡 Result:
For NVCC in CUDA 14.0+, CUDACC_VER_MAJOR is the major version number of nvcc (e.g., 14 for CUDA 14.x) and CUDACC_VER_MINOR is the minor version number of nvcc (e.g., 0 for CUDA 14.0). These macros are predefined by nvcc when compiling CUDA source files and correspond to the nvcc compiler version, which matches the CUDA Toolkit version. The recommended preprocessor expression for checking "CUDA Toolkit >= 13.2" is (CUDACC_VER_MAJOR > 13) || (CUDACC_VER_MAJOR == 13 && CUDACC_VER_MINOR >= 2), assuming nvcc version aligns with toolkit version as per release notes (e.g., CUDA NVCC 13.2.51 for CUDA 13.2).
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/
- 2: https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf
- 3: https://stackoverflow.com/questions/50648671/how-do-i-convert-a-cudacc-ver-value-into-a-major-minor-build-triplet
- 4: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
- 5: http://developer.nvidia.com/cuda-toolkit/whatsnew
Fix the CUDA-version gate to enable the fast path on CUDA 14.x and later.
Lines 486-487 check __CUDACC_VER_MAJOR__ >= 13 and separately __CUDACC_VER_MINOR__ >= 2, which breaks on CUDA 14.0 (where __CUDACC_VER_MAJOR__=14 and __CUDACC_VER_MINOR__=0). The condition should combine the version numbers to properly detect CUDA >= 13.2.
Suggested fix
-#if (defined __CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 13) && \
- (defined __CUDACC_VER_MINOR__) && (__CUDACC_VER_MINOR__ >= 2)
+#if defined(__CUDACC_VER_MAJOR__) && defined(__CUDACC_VER_MINOR__) && \
+ ((__CUDACC_VER_MAJOR__ * 10000 + __CUDACC_VER_MINOR__ * 100) >= 130200)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/vec_dtypes.cuh` around lines 486 - 510, The CUDA version
gate incorrectly requires both __CUDACC_VER_MAJOR__ >= 13 and
__CUDACC_VER_MINOR__ >= 2, which fails for CUDA 14.x; update the preprocessor
condition that guards the fast-path asm (the block using cvt.rn.bf16x2.e2m1x2
and variable y/b) to check the combined version (e.g., compare major/minor
together or compute a single numeric version) so it enables the fast path for
CUDA >= 13.2 (including 14.0+).
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/attention/test_batch_prefill_kernels.py (1)
1151-1165: Please add an asymmetrichead_dim_qk/head_dim_vocase here.
head_dim_vois omitted, so this suite only covers the defaulthead_dim_vo == head_dim_qkpath. A 192/128-style case would exercise the packed-V sizing logic that the current fixture cannot catch. The same omission exists intests/attention/test_batch_attention.py, so it would be worth updating both together.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/attention/test_batch_prefill_kernels.py` around lines 1151 - 1165, The test only exercises symmetric head sizes because wrapper.plan is called with head_dim only; add an asymmetric case by passing explicit head_dim_qk and head_dim_vo arguments to wrapper.plan (for example head_dim_qk=192, head_dim_vo=128) so the packed-V sizing logic is exercised; update the wrapper.plan invocation in tests/attention/test_batch_prefill_kernels.py (and mirror the same change in tests/attention/test_batch_attention.py) to include these two explicit parameters instead of relying on the default head_dim equality.include/flashinfer/attention/prefill.cuh (1)
449-498: Consider adding null check for defensive programming.If
sf_ptrisnullptrbutis_fp4_type_v<DTypeKV>is true, the function computes offsets and callspred_load_32bwith an invalid source pointer. While the design assumes FP4 usage implies scales are provided, a null check would add robustness:template <bool produce_v, typename KTraits, typename IdType> __device__ __forceinline__ void page_produce_kv_sf( typename KTraits::SharedStorage* smem_storage, uint8_t* sf_ptr, ...) { if constexpr (!is_fp4_type_v<typename KTraits::DTypeKV>) return; + if (sf_ptr == nullptr) return;This prevents undefined behavior if FP4 is compiled but scales are accidentally omitted at runtime.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/attention/prefill.cuh` around lines 449 - 498, The function page_produce_kv_sf may dereference sf_ptr when is_fp4_type_v<typename KTraits::DTypeKV> is true; add a defensive null check at the start of page_produce_kv_sf (after the is_fp4_type_v constexpr) that returns early if sf_ptr == nullptr to avoid computing sf_gmem_offset and calling cp_async::pred_load_32b with an invalid source pointer; keep the check independent of produce_v and ensure it triggers before the NUM_SF_ITERS loop so symbols page_produce_kv_sf, sf_ptr, is_fp4_type_v, and cp_async::pred_load_32b are addressed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/prefill.py`:
- Around line 737-738: The code currently assumes packed NVFP4 V/O width equals
q.shape[-1] when kv_block_scales (or key_block_scales/value_block_scales)
indicate a packed V cache, which breaks configs where head_dim_vo !=
head_dim_qk; update the run path to use the planned head_dim_vo instead of
q.shape[-1] for packed outputs and persist the planned head_dim_vo from plan()
onto self (e.g., self.head_dim_vo) so run() can read it; adjust any branches
that check kv_block_scales/key_block_scales/value_block_scales to select
self.head_dim_vo as the V/O width when packed is detected.
In `@include/flashinfer/cp_async.cuh`:
- Around line 192-223: pred_load_128b_from_64b: ensure the cp.async path zeroes
the upper 8 bytes to match the fallback by changing the assembly copy size to 16
while keeping src-size=8 (i.e. use cp.async.ca.shared.global with cp-size=16,
src-size=8) in both the fill-mode (kFillZero) branch and the kNoFill branch so
the upper half of the 16-byte slot is zero-padded when only 8 bytes are sourced;
keep the predicate logic and the fallback (smem_u64[1] = 0) unchanged.
---
Nitpick comments:
In `@include/flashinfer/attention/prefill.cuh`:
- Around line 449-498: The function page_produce_kv_sf may dereference sf_ptr
when is_fp4_type_v<typename KTraits::DTypeKV> is true; add a defensive null
check at the start of page_produce_kv_sf (after the is_fp4_type_v constexpr)
that returns early if sf_ptr == nullptr to avoid computing sf_gmem_offset and
calling cp_async::pred_load_32b with an invalid source pointer; keep the check
independent of produce_v and ensure it triggers before the NUM_SF_ITERS loop so
symbols page_produce_kv_sf, sf_ptr, is_fp4_type_v, and cp_async::pred_load_32b
are addressed.
In `@tests/attention/test_batch_prefill_kernels.py`:
- Around line 1151-1165: The test only exercises symmetric head sizes because
wrapper.plan is called with head_dim only; add an asymmetric case by passing
explicit head_dim_qk and head_dim_vo arguments to wrapper.plan (for example
head_dim_qk=192, head_dim_vo=128) so the packed-V sizing logic is exercised;
update the wrapper.plan invocation in
tests/attention/test_batch_prefill_kernels.py (and mirror the same change in
tests/attention/test_batch_attention.py) to include these two explicit
parameters instead of relying on the default head_dim equality.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bddea1b0-077d-4606-bfad-a761fbad1553
📒 Files selected for processing (14)
flashinfer/attention.pyflashinfer/jit/attention/modules.pyflashinfer/jit/utils.pyflashinfer/prefill.pyflashinfer/quantization/fp4_quantization.pyflashinfer/utils.pyinclude/flashinfer/attention/persistent.cuhinclude/flashinfer/attention/prefill.cuhinclude/flashinfer/cp_async.cuhinclude/flashinfer/frag_layout_swizzle.cuhinclude/flashinfer/permuted_smem.cuhinclude/flashinfer/vec_dtypes.cuhtests/attention/test_batch_attention.pytests/attention/test_batch_prefill_kernels.py
| key_block_scales, | ||
| value_block_scales, |
There was a problem hiding this comment.
Use the planned V/O width for packed NVFP4 outputs.
kv_block_scales only tells you the V cache is packed; it does not imply head_dim_vo == head_dim_qk. Using q.shape[-1] here breaks valid planned configurations with asymmetric QK/VO widths, because plan() accepts a separate head_dim_vo but run() no longer uses it once the packed path is enabled.
🐛 Suggested fix
- out_head_dim = q.shape[-1] if kv_block_scales is not None else v_cache.shape[-1]
+ out_head_dim = (
+ self._head_dim_vo
+ if self._cached_kv_data_type == torch.uint8
+ else v_cache.shape[-1]
+ )Also persist head_dim_vo from plan() onto self so the run path can reuse the planned value.
Also applies to: 2243-2255
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/prefill.py` around lines 737 - 738, The code currently assumes
packed NVFP4 V/O width equals q.shape[-1] when kv_block_scales (or
key_block_scales/value_block_scales) indicate a packed V cache, which breaks
configs where head_dim_vo != head_dim_qk; update the run path to use the planned
head_dim_vo instead of q.shape[-1] for packed outputs and persist the planned
head_dim_vo from plan() onto self (e.g., self.head_dim_vo) so run() can read it;
adjust any branches that check
kv_block_scales/key_block_scales/value_block_scales to select self.head_dim_vo
as the V/O width when packed is detected.
| __device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr, | ||
| bool predicate) { | ||
| #ifdef FLASHINFER_CP_ASYNC_ENABLED | ||
| uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); | ||
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | ||
| int src_in_bytes = predicate ? 8 : 0; | ||
| asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), | ||
| "l"(gmem_ptr), "n"(8), "r"(src_in_bytes)); | ||
|
|
||
| } else { | ||
| // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes | ||
| asm volatile( | ||
| "{\n" | ||
| " .reg .pred p;\n" | ||
| " setp.ne.b32 p, %0, 0;\n" | ||
| " @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n" | ||
| "}\n" ::"r"((int)predicate), | ||
| "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8)); | ||
| } | ||
| #else | ||
| if (predicate) { | ||
| uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr); | ||
| smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr); | ||
| // smem_u64[1] = 0xcdcdcdcdcdcdcdcd; | ||
| smem_u64[1] = 0; | ||
| } else { | ||
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | ||
| *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0); | ||
| // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef); | ||
| } | ||
| } | ||
| #endif |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In NVIDIA PTX for cp.async.ca.shared.global [dst], [src], cp-size, src-size, if cp-size=8andsrc-size=8, are bytes beyond the first 8 destination bytes touched or zero-filled?
💡 Result:
No, bytes beyond the first 8 destination bytes are neither touched nor zero-filled when cp-size=8 and src-size=8 in the PTX instruction cp.async.ca.shared.global [dst], [src], 8, 8. The instruction copies exactly cp-size (8) bytes from the source (up to src-size bytes available) to the destination, without affecting additional bytes in shared memory.
Citations:
- 1: https://docs.modular.com/mojo/stdlib/gpu/memory/async_copy
- 2: https://ita9naiwa.github.io/mlsys/2025/10/05/ptx-mental-model.html
- 3: https://forums.developer.nvidia.com/t/ptx-cp-async-purpose-of-ignore-src-and-prefetch-size/217746
- 4: cp async fallback NVIDIA/cutlass#761
pred_load_128b_from_64b leaves the upper half of the slot uninitialized.
cp.async.ca.shared.global [dst], [src], 8, src-size writes only 8 bytes; the remaining 8 bytes of the 16-byte shared-memory slot are left untouched and retain stale data. The fallback path explicitly zeroes the upper half with smem_u64[1] = 0, creating divergent behavior. Use cp-size=16 with src-size=8 to ensure zero-padding matches the fallback path.
🐛 Suggested fix
- int src_in_bytes = predicate ? 8 : 0;
- asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
- "l"(gmem_ptr), "n"(8), "r"(src_in_bytes));
+ int src_in_bytes = predicate ? 8 : 0;
+ asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
+ "l"(gmem_ptr), "n"(16), "r"(src_in_bytes));- "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8));
+ "r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr, | |
| bool predicate) { | |
| #ifdef FLASHINFER_CP_ASYNC_ENABLED | |
| uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); | |
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | |
| int src_in_bytes = predicate ? 8 : 0; | |
| asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), | |
| "l"(gmem_ptr), "n"(8), "r"(src_in_bytes)); | |
| } else { | |
| // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes | |
| asm volatile( | |
| "{\n" | |
| " .reg .pred p;\n" | |
| " setp.ne.b32 p, %0, 0;\n" | |
| " @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n" | |
| "}\n" ::"r"((int)predicate), | |
| "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8)); | |
| } | |
| #else | |
| if (predicate) { | |
| uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr); | |
| smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr); | |
| // smem_u64[1] = 0xcdcdcdcdcdcdcdcd; | |
| smem_u64[1] = 0; | |
| } else { | |
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | |
| *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0); | |
| // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef); | |
| } | |
| } | |
| #endif | |
| __device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr, | |
| bool predicate) { | |
| `#ifdef` FLASHINFER_CP_ASYNC_ENABLED | |
| uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); | |
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | |
| int src_in_bytes = predicate ? 8 : 0; | |
| asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), | |
| "l"(gmem_ptr), "n"(16), "r"(src_in_bytes)); | |
| } else { | |
| // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes | |
| asm volatile( | |
| "{\n" | |
| " .reg .pred p;\n" | |
| " setp.ne.b32 p, %0, 0;\n" | |
| " `@p` cp.async.ca.shared.global [%1], [%2], %3, %4;\n" | |
| "}\n" ::"r"((int)predicate), | |
| "r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8)); | |
| } | |
| `#else` | |
| if (predicate) { | |
| uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr); | |
| smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr); | |
| // smem_u64[1] = 0xcdcdcdcdcdcdcdcd; | |
| smem_u64[1] = 0; | |
| } else { | |
| if constexpr (fill_mode == SharedMemFillMode::kFillZero) { | |
| *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0); | |
| // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef); | |
| } | |
| } | |
| `#endif` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/cp_async.cuh` around lines 192 - 223,
pred_load_128b_from_64b: ensure the cp.async path zeroes the upper 8 bytes to
match the fallback by changing the assembly copy size to 16 while keeping
src-size=8 (i.e. use cp.async.ca.shared.global with cp-size=16, src-size=8) in
both the fill-mode (kFillZero) branch and the kNoFill branch so the upper half
of the 16-byte slot is zero-padded when only 8 bytes are sourced; keep the
predicate logic and the fallback (smem_u64[1] = 0) unchanged.
|
[FAILED] Pipeline #46514756: 6/20 passed |
|
/bot run |
…2725) ## Summary SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks. **Changes:** - `jit/fused_moe.py`: Add major version 12 to `supported_major_versions` - `csrc/trtllm_fused_moe_kernel_launcher.cu`: `ICHECK_EQ(major, 10)` -> `ICHECK_GE(major, 10)` **Benchmark** (Qwen3.5-397B on 4x RTX PRO 6000 SM120): | Config | tok/s | Output | |--------|-------|--------| | compute_120f (CUDA 13.0) | 39.0 | Correct | | compute_120a (CUDA 12.8) | 14.6 | Correct (slow fallback) | | Marlin W4A16 | 46-49 | Correct | **Root cause:** All TMA WS grouped GEMM autotuner tactics fail on `compute_120a`, requiring `compute_120f` (CUDA 13.0). CuTe DSL `admissible_archs` in vendored CUTLASS also needs `sm_120a`/`sm_120f` (cpasync/copy.py, tcgen05/mma.py, arch/mbar.py, etc). Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Broadened GPU architecture checks to accept additional modern compute capabilities (SM 10.x and 12.x), improving compatibility and clearer SM reporting. * Improved compute-capability detection and encoding, preserving user-provided architecture suffixes and more accurately generating nvcc architecture flags. * Expanded JIT module generation to include additional CUDA majors so fused-MoE kernels run on more recent GPUs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Brandon Music <brandon.m.music@gmail.com> Co-authored-by: Brandon Music <brandonmmusic-max@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Music <brandonmusic@pop-os.tail8674da.ts.net>
|
[FAILED] Pipeline #46572392: 6/20 passed |
📌 Description
This MR supports NVFP4 KV input for batch prefill and batch attention kernels. It widely supports all arch.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Release Notes
New Features
Tests