Support NVFP4 KV for prefill and batch attention kernels by Tom-Zheng · Pull Request #2820 · flashinfer-ai/flashinfer

Tom-Zheng · 2026-03-19T11:30:33Z

📌 Description

This MR supports NVFP4 KV input for batch prefill and batch attention kernels. It widely supports all arch.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

New Features
- Added support for FP4-quantized KV cache (NVFP4), enabling more memory-efficient inference with reduced memory footprint.
- Added block-level scale factor support to improve accuracy when using quantized KV cache.
Tests
- Added test coverage for FP4-quantized KV cache functionality across batch attention and prefill kernels.

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

coderabbitai · 2026-03-19T11:30:55Z

📝 Walkthrough

Walkthrough

This PR introduces NVFP4 (4-bit floating-point) KV cache support with per-block scale factors throughout the attention and prefill pipelines. Changes include Python API extensions to accept KV block scales, CUDA kernel modifications to load and apply scale factors, new dequantization paths for FP4, and corresponding test coverage.

Changes

Cohort / File(s)	Summary
Python API Extensions `flashinfer/attention.py`, `flashinfer/prefill.py`	Added `kv_block_scales` and `key_block_scales`/`value_block_scales` optional parameters to `BatchAttention.run()` and `paged_run()` custom op respectively, enabling per-block scale tensor passing.
JIT Code Generation `flashinfer/jit/attention/modules.py`, `flashinfer/jit/utils.py`	Introduced `dtype_map_kv` dictionary for KV dtype mapping; updated multiple module generators to use it; extended batch attention/prefill module tensor inputs with `maybe_k_cache_sf` and `maybe_v_cache_sf`; reworked additional-params generation for conditional pointer assignment.
CUDA Kernel Enhancements `include/flashinfer/attention/prefill.cuh`, `include/flashinfer/attention/persistent.cuh`	Extended shared storage with FP4 scale-factor buffers; added `page_produce_kv_sf` device routine; modified `compute_qk` and `compute_sfm_v` to accept and apply scale factors; introduced FP4-specific packed GMEM handling with adjusted offsets.
FP4 Dequantization Support `flashinfer/quantization/fp4_quantization.py`	Added CPU fallback dequantization path for SM<90 compute capability; implemented `_e2m1_and_ufp8sf_scale_to_float_cpu()` with E2M1 4-bit LUT and scale unpacking logic.
Backend Utilities & Device Functions `flashinfer/utils.py`, `include/flashinfer/cp_async.cuh`, `include/flashinfer/frag_layout_swizzle.cuh`, `include/flashinfer/permuted_smem.cuh`, `include/flashinfer/vec_dtypes.cuh`	Updated FA3 support check to reject `torch.uint8` dtype; added predicate-aware async load helpers (`pred_load_128b_from_64b`, `pred_load_32b`); added FP4 fragment swizzle functions; added async shared-memory load API; added `vec_cast` specializations for NVFP4-to-float dequantization with hardware/software paths.
Test Coverage `tests/attention/test_batch_attention.py`, `tests/attention/test_batch_prefill_kernels.py`	Added parameterized NVFP4 tests for batch attention and prefill with paged KV cache; includes NVFP4 data generation, dequantization verification, and numerical validation against reference implementations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat: Add FP4 KV cache quant/dequant kernels #2757 — Adds NVFP4 FP4 KV quantization/dequantization support with per-block scale handling and kernel integration.
Added an initial implementation of Q and KV Cache in fp8 and to use t… #2035 — Propagates per-K/V scale tensors through attention/prefill codepaths and updates run/paged_run signatures.
fallback to fa2 (instead of fa3) for unsupported configuration (bf16 Q, Fp8 KV) #2536 — Modifies backend-selection logic in flashinfer/utils.py to adjust FA3 support based on KV dtype.

Suggested labels

run-ci

Suggested reviewers

nvmbreughe
yzh119
cyx-6
kahyunnam
jimmyzho
bkryu
jiahanc
Anerudhan

Poem

🐰 Four bits of magic, packed so tight,
Scale factors dancing in GPU light,
KV caches compressed, precision aligned,
Attention blooms with FP4 refined! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding NVFP4 KV input support for prefill and batch attention kernels, which aligns with the primary objective and changeset.
Description check	✅ Passed	The description explains what the PR does (NVFP4 KV support for batch prefill/attention), indicates all checklist items are complete, and confirms tests are added and passing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

gemini-code-assist · 2026-03-19T11:31:10Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances FlashInfer's capabilities by integrating native support for NVFP4 KV cache input across its batch prefill and batch attention kernels. This allows for more memory-efficient and potentially faster inference on NVIDIA GPUs by leveraging 4-bit quantization for key and value tensors. The changes span from low-level CUDA kernel implementations for data loading and computation to Python-level utilities for quantization, dequantization, and comprehensive testing, ensuring broad compatibility and correctness.

Highlights

NVFP4 KV Cache Support: Introduced support for NVFP4 (NVIDIA FP4) KV cache input for both batch prefill and batch attention kernels, extending low-precision capabilities across all supported architectures.
Quantization and Dequantization Utilities: Added Python utilities (_to_nvfp4, _nvfp4_to_float) for quantizing float tensors to NVFP4 and dequantizing them back, including handling of per-group FP8 scaling factors.
CUDA Kernel Modifications: Modified core CUDA kernels (prefill.cuh, persistent.cuh) to handle NVFP4 data loading, shared memory management, and apply scaling factors during QK and SFM*V computations. This includes new is_fp4_type traits and shared memory allocations for scale factors.
Asynchronous Memory Operations: Implemented new cp_async primitives (pred_load_128b_from_64b, pred_load_32b) for efficient asynchronous loading of packed FP4 data and their corresponding scale factors from global to shared memory.
Fragment Layout Swizzling: Added specialized fragment layout swizzling functions (frag_layout_swizzle_16b_to_4b, frag_layout_swizzle_16b_to_4b_trans) to correctly process packed FP4 data within MMA operations.
Type Casting and Fallbacks: Extended vec_cast to support conversion from __nv_fp4x2_e2m1 (packed FP4) to half and nv_bfloat16, including a pure-PyTorch CPU fallback for dequantization on older GPU architectures (pre-SM90).
Benchmarking and Testing: Updated benchmark routines to include NVFP4 as a supported KV data type, adjusted numerical tolerances for lower precision, and added new correctness tests for batch attention and batch prefill with paged KV cache using NVFP4.
Documentation: Added new internal documentation files (.claude/memory/MEMORY.md, .claude/memory/prefill_cuh_structure.md) to analyze and describe the structure of prefill.cuh.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

samuellees · 2026-03-19T11:36:00Z

/bot run

gemini-code-assist

Code Review

This pull request introduces support for NVFP4 (NVIDIA FP4) KV cache quantization in FlashInfer's prefill and batch attention operations. Key changes include updating benchmark routines to handle NVFP4 as a KV data type, adjusting tolerances for lower precision, and filtering unsupported backends. The core C++ kernels and Python JIT modules are extended to manage packed NVFP4 data and per-group scale factors, including modifications to memory access patterns and MMA operations for proper dequantization. Review comments highlight the need for improved clarity in the global_scale calculation in _to_nvfp4 and detailed explanations for the intricate scaling factor application logic within the compute_qk and compute_sfm_v device functions.

gemini-code-assist · 2026-03-19T11:36:42Z

benchmarks/routines/attention.py

+
+    min_val, max_val = x.float().aminmax()
+    amax = torch.maximum(min_val.abs(), max_val.abs()).clamp(min=1e-12)
+    global_scale = amax / (448.0 * 6.0)


The scaling factor 6.0 in the global_scale calculation appears to be a magic number. Please add a comment explaining its derivation or purpose in the NVFP4 quantization scheme. This will improve the clarity and maintainability of the code.

gemini-code-assist · 2026-03-19T11:36:42Z

include/flashinfer/attention/prefill.cuh

+        if constexpr (is_fp4_type_v<typename KTraits::DTypeKV>) {
+          // Apply scaling factors for K.
+          // SF smem is linear: sf[kv_row * SF_COLS + hd_group], SF_COLS = HEAD_DIM_QK/16.
+          // For m16n8k16 B layout, thread t's KV rows are t/4 and t/4+8 in the mma_kv tile.
+          // b_frag[0,1] share KV row (t/4), b_frag[2,3] share KV row (t/4+8).
+          using DTypeQ_ = typename KTraits::DTypeQ;
+          using packed2_ = std::conditional_t<std::is_same_v<DTypeQ_, half>, half2, __nv_bfloat162>;
+          constexpr uint32_t SF_COLS_K = KTraits::NUM_MMA_D_QK;  // HEAD_DIM_QK / 16
+          uint32_t sf_base = (mma_kv * 16 + lane_idx / 4) * SF_COLS_K + mma_d;
+          __nv_fp8_e4m3 sf_a_fp8, sf_b_fp8;
+          sf_a_fp8.__x = k_sf_smem[sf_base];
+          sf_b_fp8.__x = k_sf_smem[sf_base + 8 * SF_COLS_K];
+          packed2_ scale_a{static_cast<DTypeQ_>(sf_a_fp8), static_cast<DTypeQ_>(sf_a_fp8)};
+          packed2_ scale_b{static_cast<DTypeQ_>(sf_b_fp8), static_cast<DTypeQ_>(sf_b_fp8)};
+          *(packed2_*)&b_frag[0] = __hmul2(*(packed2_*)&b_frag[0], scale_a);
+          *(packed2_*)&b_frag[1] = __hmul2(*(packed2_*)&b_frag[1], scale_a);
+          *(packed2_*)&b_frag[2] = __hmul2(*(packed2_*)&b_frag[2], scale_b);
+          *(packed2_*)&b_frag[3] = __hmul2(*(packed2_*)&b_frag[3], scale_b);
+        }


The logic for applying scaling factors in compute_qk is quite intricate, involving SF_COLS_K, sf_base, lane_idx, and __hmul2. Adding comments to explain how sf_base is calculated to correctly index into k_sf_smem and how lane_idx contributes to the thread mapping for scaling would significantly improve the readability and maintainability of this critical section.

gemini-code-assist · 2026-03-19T11:36:42Z

include/flashinfer/attention/prefill.cuh

+        if constexpr (is_fp4_type_v<typename KTraits::DTypeKV>) {
+          // Apply scaling factors for V.
+          // SF smem is linear: sf[kv_row * SF_COLS + hd_group], SF_COLS = HEAD_DIM_VO/16.
+          // For transposed B (V), thread t's KV rows are 2*(t%4)+{0,1} and 2*(t%4)+{8,9}
+          // in the mma_kv tile. After swap, b_frag[0,2] cover rows {r0, r0+1} and
+          // b_frag[1,3] cover rows {r0+8, r0+9}. Each half2 needs two distinct SFs.
+          using DTypeQ_ = typename KTraits::DTypeQ;
+          using packed2_ = std::conditional_t<std::is_same_v<DTypeQ_, half>, half2, __nv_bfloat162>;
+          constexpr uint32_t SF_COLS_V = KTraits::NUM_MMA_D_VO;  // HEAD_DIM_VO / 16
+          uint32_t sf_base = (mma_kv * 16 + 2 * (lane_idx % 4)) * SF_COLS_V + mma_d;
+          __nv_fp8_e4m3 sf0_fp8, sf1_fp8, sf2_fp8, sf3_fp8;
+          sf0_fp8.__x = v_sf_smem[sf_base];
+          sf1_fp8.__x = v_sf_smem[sf_base + SF_COLS_V];
+          sf2_fp8.__x = v_sf_smem[sf_base + 8 * SF_COLS_V];
+          sf3_fp8.__x = v_sf_smem[sf_base + 9 * SF_COLS_V];
+          packed2_ scale_lo{static_cast<DTypeQ_>(sf0_fp8), static_cast<DTypeQ_>(sf1_fp8)};
+          packed2_ scale_hi{static_cast<DTypeQ_>(sf2_fp8), static_cast<DTypeQ_>(sf3_fp8)};
+          *(packed2_*)&b_frag[0] = __hmul2(*(packed2_*)&b_frag[0], scale_lo);
+          *(packed2_*)&b_frag[1] = __hmul2(*(packed2_*)&b_frag[1], scale_hi);
+          *(packed2_*)&b_frag[2] = __hmul2(*(packed2_*)&b_frag[2], scale_lo);
+          *(packed2_*)&b_frag[3] = __hmul2(*(packed2_*)&b_frag[3], scale_hi);
+        }


Similar to compute_qk, the scaling factor application in compute_sfm_v is complex. Clarifying the calculation of SF_COLS_V, sf_base, and the role of lane_idx in mapping to v_sf_smem would be very helpful. An explanation of how packed2_ scale_lo and scale_hi are constructed and applied would also enhance understanding.

flashinfer-bot · 2026-03-19T11:37:26Z

GitLab MR !431 has been created, and the CI pipeline #46514756 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

benchmarks/routines/attention.py (2)
1197-1201: ⚠️ Potential issue | 🟠 Major

Don't quantize the paged NVFP4 cache twice.

Lines 1036-1059 already build packed kv_cache plus kv_cache_sf. Lines 1197-1201 then feed that packed uint8 data back into nvfp4_quantize_paged_kv_cache(...), which expects floating-point KV input and discards the scales you just computed. Reuse kv_cache_sf as kv_block_scales here.
Suggested change
-    if is_nvfp4_kv:
-        kv_cache_nvfp4, kv_block_scales, k_scale, v_scale = (
-            nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1])
-        )
-        kv_cache = kv_cache_nvfp4
+    if use_nvfp4_kv:
+        kv_block_scales = kv_cache_sf
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/attention.py` around lines 1197 - 1201, The code is
re-quantizing an already-packed NVFP4 paged KV cache: when is_nvfp4_kv is true
you call nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1]) on
uint8-packed data and overwrite the correct scales; instead, reuse the
precomputed packed cache and scales (kv_cache and kv_cache_sf) produced earlier:
set kv_cache = kv_cache (or keep existing packed variable) and assign
kv_block_scales = kv_cache_sf (and ensure k_scale and v_scale use the previously
computed values), removing the nvfp4_quantize_paged_kv_cache call inside the
is_nvfp4_kv branch so you don't discard the original scales.
865-867: ⚠️ Potential issue | 🔴 Critical

Use one NVFP4 feature-flag name throughout this function.

Line 865 defines is_nvfp4_kv, but the new branches later read use_nvfp4_kv (for example Lines 883, 965, 1036, and 1218). testBatchPrefillWithPagedKVCacheWrapper() currently throws before any benchmark runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/routines/attention.py` around lines 865 - 867, The function uses
two inconsistent feature-flag names (is_nvfp4_kv and use_nvfp4_kv) causing
branches to miss the intended flag; unify them by picking one name (e.g.,
replace the initial definition is_nvfp4_kv = args.kv_dtype == "nvfp4" with
use_nvfp4_kv = args.kv_dtype == "nvfp4" or create use_nvfp4_kv = is_nvfp4_kv
immediately after) and update all branch checks (references in the function such
as the later conditionals at lines referencing use_nvfp4_kv) to use that single
symbol so the NVFP4 path is consistently triggered (also ensure any dtype checks
that set kv_dtype remain correct).
include/flashinfer/attention/prefill.cuh (1)
1671-1702: ⚠️ Potential issue | 🟠 Major

These prefill paths now consume FP4 scale tiles without ever producing them.

compute_qk / compute_sfm_v now dereference k_sf_smem and v_sf_smem for FP4 KV, but SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel still only call produce_kv(...). If either path is instantiated with __nv_fp4x2_e2m1, it will multiply against uninitialized shared memory and silently corrupt the result. Please either plumb per-row SF loads into these kernels too, or add a compile-time guard that keeps FP4 limited to the paged path for now.

Also applies to: 2118-2150
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/attention/prefill.cuh` around lines 1671 - 1702, The
prefill paths call compute_qk and compute_sfm_v which read k_sf_smem/v_sf_smem
for FP4 but the kernels SinglePrefillWithKVCacheDevice and
BatchPrefillWithRaggedKVCacheKernel only call produce_kv and never populate
those scale-factor tiles, so instantiating with __nv_fp4x2_e2m1 will read
uninitialized shared memory; fix by adding a compile-time guard that prevents
FP4 (e.g. static_assert or if constexpr) in SinglePrefillWithKVCacheDevice and
BatchPrefillWithRaggedKVCacheKernel (or the wrapper that calls produce_kv) when
KTraits::ScalarType == __nv_fp4x2_e2m1, OR alternatively plumb the per-row SF
loads into those kernels by ensuring produce_kv is invoked with the
SharedMemFillMode that fills k_sf_smem/v_sf_smem (or explicitly call the SF fill
helper) before compute_qk/compute_sfm_v are executed; pick one approach and
apply it consistently to both call sites referencing compute_qk, compute_sfm_v,
k_sf_smem, v_sf_smem, and produce_kv.

🧹 Nitpick comments (4)

flashinfer/jit/attention/modules.py (1)
1836-1870: Factor the batch-attention setter generation out of this function.

This copies the nullable/scalar assignment rules from generate_additional_params(). A small helper that accepts the target prefix (params, params[i], params.additional_params) would keep the batch path from drifting the next time additional-parameter semantics change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/jit/attention/modules.py` around lines 1836 - 1870, Extract the
logic that builds the param-assignment lines into a small helper (e.g.,
generate_additional_params_setter(prefix, additional_tensor_names,
additional_tensor_dtypes, additional_scalar_names)) and replace the inline
batch_additional_params_setter construction with a call to that helper; the
helper should implement the same nullable tensor/ scalar rules currently
duplicated (the conditional branch for var.startswith("maybe") and the scalar
formatting) but use the provided target prefix (e.g., "params[i]", "params", or
"params.additional_params") when formatting each assignment; update the call
sites (the batch path that currently creates batch_additional_params_setter and
any other place using generate_additional_params output) to call the new helper
so semantics remain identical but the formatting logic is centralized.
.claude/memory/prefill_cuh_structure.md (1)
40-62: Call out the NVFP4 scale-factor path explicitly.

This note still reads like a generic prefill overview. The new FP4-specific pieces—maybe_k_cache_sf / maybe_v_cache_sf, page_produce_kv_sf, and the shared-memory scale buffers consumed by compute_qk / compute_sfm_v—are exactly what future readers will look for in this PR.

Based on learnings: Keep documentation in sync with code changes, particularly CLAUDE.md and .claude/skills/ when modifying infrastructure changes, patterns, new conventions, or deprecations.

Also applies to: 75-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/memory/prefill_cuh_structure.md around lines 40 - 62, Update the
prefill overview to explicitly document the NVFP4 scale-factor path: describe
the new FP4-specific symbols maybe_k_cache_sf and maybe_v_cache_sf, the
page_produce_kv_sf path, and the shared-memory scale buffers that compute_qk and
compute_sfm_v consume; note where these are emitted/loaded and how they flow
through page_produce_kv_sf → shared-memory buffers → compute_qk/compute_sfm_v,
and add a short cross-reference to the infra docs/skills that must be updated
when changing these conventions so readers can find the FP4 scale-factor
behavior quickly.
tests/attention/test_batch_attention.py (1)
308-309: Exercise signed E2M1 codes too.

Line 309 clears both sign bits, so this test never covers negative NVFP4 values. A sign-handling regression would still pass here; either remove the mask or add a second signed-data case.
Suggested change
-    packed &= 0x77  # clear bit 3 (0x08) and bit 7 (0x80) to ensure non-negative
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_batch_attention.py` around lines 308 - 309, The test
currently forces all packed bytes non-negative by applying "packed &= 0x77", so
negative NVFP4 (signed E2M1) values are never exercised; either remove the mask
expression "packed &= 0x77" to allow both signs, or add a second test case that
constructs a signed-data variant (e.g., copy the existing "packed" and set the
sign bits for NVFP4 by OR-ing the appropriate bits such as 0x08 and/or 0x80) and
run the same assertions on that signed copy so both unsigned and negative NVFP4
paths are covered.
tests/attention/test_batch_prefill_kernels.py (1)
1074-1083: Add at least one causal NVFP4 case.

This matrix hardcodes causal=False, so the new NVFP4 path never exercises the masked/tail-tile logic that changed in the kernel code. A small causal=True case would cover the scale-factor path under masking as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_batch_prefill_kernels.py` around lines 1074 - 1083, The
test matrix currently forces causal=False so the NVFP4 kernel path never hits
masked/tail-tile logic; update the test_batch_prefill_with_paged_kv_cache_nvfp4
parameterization (the `@pytest.mark.parametrize`("causal", ...) on the test) to
include True (e.g., [False, True]) so at least one run exercises the
causal/masked path for NVFP4; keep existing q_dtype values unchanged so the
NVFP4 path is still exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/routines/attention.py`:
- Around line 1667-1675: The NVFP4 allowlist is too permissive: in the
use_nvfp4_kv block (variable backends and list nvfp4_unsupported) remove "auto"
and "trtllm-native" from the supported set so ragged NVFP4 only stays enabled
for backends that actually wrap ragged K/V (e.g., "fa2" and "fa3"); update the
allowed list used to compute nvfp4_unsupported from ["fa2", "trtllm-native",
"auto"] to only ["fa2", "fa3"] (or the concrete backends that implement ragged
NVFP4) so backends that don't forward k_sf/v_sf are filtered out.

In `@include/flashinfer/cp_async.cuh`:
- Around line 191-224: The cp.async call in pred_load_128b_from_64b uses
cp_size=8 which doesn’t zero the upper 8 bytes; change the cp.async
invocation(s) to use cp_size=16 and pass src_size as a variable (src_size =
predicate ? 8 : 0) so cp.async zero-fills bytes 8..15 when src_size is 0, and
similarly adjust the kNoFill branch to issue cp.async with cp_size=16 and
src_size conditionally 8 or 0 (instead of cp_size=8); also apply the same fix
pattern to the 32b helper described in the comment: use cp_size appropriate to
the full destination (e.g., 16 for 128b destination or 4 for 32b helper) and
make src_size variable (0 when wanting explicit zero-fill, nonzero when copying)
so cp.async actually zeros the upper bytes.

In `@include/flashinfer/vec_dtypes.cuh`:
- Around line 486-510: The CUDA version gate incorrectly requires both
__CUDACC_VER_MAJOR__ >= 13 and __CUDACC_VER_MINOR__ >= 2, which fails for CUDA
14.x; update the preprocessor condition that guards the fast-path asm (the block
using cvt.rn.bf16x2.e2m1x2 and variable y/b) to check the combined version
(e.g., compare major/minor together or compute a single numeric version) so it
enables the fast path for CUDA >= 13.2 (including 14.0+).

---

Outside diff comments:
In `@benchmarks/routines/attention.py`:
- Around line 1197-1201: The code is re-quantizing an already-packed NVFP4 paged
KV cache: when is_nvfp4_kv is true you call
nvfp4_quantize_paged_kv_cache(kv_cache[:, 0], kv_cache[:, 1]) on uint8-packed
data and overwrite the correct scales; instead, reuse the precomputed packed
cache and scales (kv_cache and kv_cache_sf) produced earlier: set kv_cache =
kv_cache (or keep existing packed variable) and assign kv_block_scales =
kv_cache_sf (and ensure k_scale and v_scale use the previously computed values),
removing the nvfp4_quantize_paged_kv_cache call inside the is_nvfp4_kv branch so
you don't discard the original scales.
- Around line 865-867: The function uses two inconsistent feature-flag names
(is_nvfp4_kv and use_nvfp4_kv) causing branches to miss the intended flag; unify
them by picking one name (e.g., replace the initial definition is_nvfp4_kv =
args.kv_dtype == "nvfp4" with use_nvfp4_kv = args.kv_dtype == "nvfp4" or create
use_nvfp4_kv = is_nvfp4_kv immediately after) and update all branch checks
(references in the function such as the later conditionals at lines referencing
use_nvfp4_kv) to use that single symbol so the NVFP4 path is consistently
triggered (also ensure any dtype checks that set kv_dtype remain correct).

In `@include/flashinfer/attention/prefill.cuh`:
- Around line 1671-1702: The prefill paths call compute_qk and compute_sfm_v
which read k_sf_smem/v_sf_smem for FP4 but the kernels
SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel only call
produce_kv and never populate those scale-factor tiles, so instantiating with
__nv_fp4x2_e2m1 will read uninitialized shared memory; fix by adding a
compile-time guard that prevents FP4 (e.g. static_assert or if constexpr) in
SinglePrefillWithKVCacheDevice and BatchPrefillWithRaggedKVCacheKernel (or the
wrapper that calls produce_kv) when KTraits::ScalarType == __nv_fp4x2_e2m1, OR
alternatively plumb the per-row SF loads into those kernels by ensuring
produce_kv is invoked with the SharedMemFillMode that fills k_sf_smem/v_sf_smem
(or explicitly call the SF fill helper) before compute_qk/compute_sfm_v are
executed; pick one approach and apply it consistently to both call sites
referencing compute_qk, compute_sfm_v, k_sf_smem, v_sf_smem, and produce_kv.

---

Nitpick comments:
In @.claude/memory/prefill_cuh_structure.md:
- Around line 40-62: Update the prefill overview to explicitly document the
NVFP4 scale-factor path: describe the new FP4-specific symbols maybe_k_cache_sf
and maybe_v_cache_sf, the page_produce_kv_sf path, and the shared-memory scale
buffers that compute_qk and compute_sfm_v consume; note where these are
emitted/loaded and how they flow through page_produce_kv_sf → shared-memory
buffers → compute_qk/compute_sfm_v, and add a short cross-reference to the infra
docs/skills that must be updated when changing these conventions so readers can
find the FP4 scale-factor behavior quickly.

In `@flashinfer/jit/attention/modules.py`:
- Around line 1836-1870: Extract the logic that builds the param-assignment
lines into a small helper (e.g., generate_additional_params_setter(prefix,
additional_tensor_names, additional_tensor_dtypes, additional_scalar_names)) and
replace the inline batch_additional_params_setter construction with a call to
that helper; the helper should implement the same nullable tensor/ scalar rules
currently duplicated (the conditional branch for var.startswith("maybe") and the
scalar formatting) but use the provided target prefix (e.g., "params[i]",
"params", or "params.additional_params") when formatting each assignment; update
the call sites (the batch path that currently creates
batch_additional_params_setter and any other place using
generate_additional_params output) to call the new helper so semantics remain
identical but the formatting logic is centralized.

In `@tests/attention/test_batch_attention.py`:
- Around line 308-309: The test currently forces all packed bytes non-negative
by applying "packed &= 0x77", so negative NVFP4 (signed E2M1) values are never
exercised; either remove the mask expression "packed &= 0x77" to allow both
signs, or add a second test case that constructs a signed-data variant (e.g.,
copy the existing "packed" and set the sign bits for NVFP4 by OR-ing the
appropriate bits such as 0x08 and/or 0x80) and run the same assertions on that
signed copy so both unsigned and negative NVFP4 paths are covered.

In `@tests/attention/test_batch_prefill_kernels.py`:
- Around line 1074-1083: The test matrix currently forces causal=False so the
NVFP4 kernel path never hits masked/tail-tile logic; update the
test_batch_prefill_with_paged_kv_cache_nvfp4 parameterization (the
`@pytest.mark.parametrize`("causal", ...) on the test) to include True (e.g.,
[False, True]) so at least one run exercises the causal/masked path for NVFP4;
keep existing q_dtype values unchanged so the NVFP4 path is still exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c342064-89d2-4570-b892-646302704193

📥 Commits

Reviewing files that changed from the base of the PR and between c4a159a and 193a6c6.

📒 Files selected for processing (18)

.claude/memory/MEMORY.md
.claude/memory/prefill_cuh_structure.md
benchmarks/routines/attention.py
flashinfer/attention.py
flashinfer/jit/attention/modules.py
flashinfer/jit/utils.py
flashinfer/prefill.py
flashinfer/quantization/fp4_quantization.py
flashinfer/utils.py
include/flashinfer/attention/persistent.cuh
include/flashinfer/attention/prefill.cuh
include/flashinfer/cp_async.cuh
include/flashinfer/frag_layout_swizzle.cuh
include/flashinfer/permuted_smem.cuh
include/flashinfer/vec_dtypes.cuh
mha_ref.cu
tests/attention/test_batch_attention.py
tests/attention/test_batch_prefill_kernels.py

👮 Files not reviewed due to content moderation or server errors (7)

flashinfer/utils.py
.claude/memory/MEMORY.md
flashinfer/jit/utils.py
flashinfer/attention.py
include/flashinfer/permuted_smem.cuh
flashinfer/prefill.py
flashinfer/quantization/fp4_quantization.py

coderabbitai · 2026-03-19T12:07:25Z

benchmarks/routines/attention.py

+    # NVFP4 KV is only supported by fa2/fa3/auto backends
+    if use_nvfp4_kv:
+        nvfp4_unsupported = [
+            b for b in backends if b not in ["fa2", "trtllm-native", "auto"]
+        ]
+        for b in nvfp4_unsupported:
+            print(f"[INFO] Backend {b} does not support NVFP4 KV cache. Skipping.")
+            backends.remove(b)
+


⚠️ Potential issue | 🟠 Major

Ragged NVFP4 should not keep auto or trtllm-native in the allowlist.

auto never gets a wrapper in Lines 1843-1896, and the trtllm-native branch at Lines 1975-1998 runs packed uint8 K/V without ever forwarding k_sf / v_sf. Leaving either backend here turns --kv_dtype nvfp4 into an unsupported benchmark path.

Suggested change

- nvfp4_unsupported = [ - b for b in backends if b not in ["fa2", "trtllm-native", "auto"] - ] + nvfp4_unsupported = [b for b in backends if b not in ["fa2"]]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/routines/attention.py` around lines 1667 - 1675, The NVFP4 allowlist is too permissive: in the use_nvfp4_kv block (variable backends and list nvfp4_unsupported) remove "auto" and "trtllm-native" from the supported set so ragged NVFP4 only stays enabled for backends that actually wrap ragged K/V (e.g., "fa2" and "fa3"); update the allowed list used to compute nvfp4_unsupported from ["fa2", "trtllm-native", "auto"] to only ["fa2", "fa3"] (or the concrete backends that implement ragged NVFP4) so backends that don't forward k_sf/v_sf are filtered out.

coderabbitai · 2026-03-19T12:07:25Z

include/flashinfer/cp_async.cuh

+template <PrefetchMode prefetch_mode, SharedMemFillMode fill_mode, typename T>
+__device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr,
+                                                        bool predicate) {
+#ifdef FLASHINFER_CP_ASYNC_ENABLED
+  uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+  if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
+    int src_in_bytes = predicate ? 8 : 0;
+    asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
+                 "l"(gmem_ptr), "n"(8), "r"(src_in_bytes));
+
+  } else {
+    // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes
+    asm volatile(
+        "{\n"
+        " .reg .pred p;\n"
+        " setp.ne.b32 p, %0, 0;\n"
+        " @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n"
+        "}\n" ::"r"((int)predicate),
+        "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8));
+  }
+#else
+  if (predicate) {
+    uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr);
+    smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr);
+    // smem_u64[1] = 0xcdcdcdcdcdcdcdcd;
+    smem_u64[1] = 0;
+  } else {
+    if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
+      *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0);
+      // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef);
+    }
+  }
+#endif
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's read the actual file content to see the exact implementation head -n 250 include/flashinfer/cp_async.cuh | tail -n 70

Repository: flashinfer-ai/flashinfer

Length of output: 3192

🌐 Web query:

PTX ISA cp.async.ca.shared.global instruction semantics src-size cp-size parameters

💡 Result:

The PTX ISA instruction cp.async.ca.shared.global performs an asynchronous copy from global memory to shared memory with cache allocation hints. It is available on SM_80+ architectures (Ampere+). Syntax examples from LLVM NVVM intrinsics and forum discussions: - cp.async.ca.shared.global.4 [dst_shared], [src_global], src_size - cp.async.ca.shared.global.8 [dst_shared], [src_global], src_size - cp.async.ca.shared.global.16 [dst_shared], [src_global], src_size Semantics: - Asynchronous: The copy is non-blocking; the thread continues execution immediately. Completion requires explicit synchronization like cp.async.wait_all or mbarrier. - Direction: Global (.global) source to shared (.shared) destination. - Cache hints: ca = cache-at-level (allocates cache lines in L2), specific to global reads. - Size variants: 4, 8, or 16 bytes per instruction. src-size and cp-size parameters: - src-size (third operand): Specifies the actual source data size in bytes for prefetching/optimization. Optional in some intrinsics but required for precise hardware behavior (e.g., LLVM nvvm_cp_async_ca_shared_global_4 takes src_size). Matches transfer size (4/8/16) for aligned copies but allows optimization hints. - cp-size: Likely refers to copy size, same as the instruction variant (4/8/16 bytes). Not a separate parameter; encoded in opcode. The operation queues the copy in hardware pipelines for overlap with computation. Use with shared memory barriers or wait instructions for consumer threads. For bulk copies, use cp.async.bulk variants (SM_90+). Sources confirm usage in forums (1,6,11), LLVM docs (8,18), PTX ISA docs (2,3,4,12,19), and CUDA guides (13,22).

Citations:

1: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219

2: https://docs.nvidia.com/cuda/parallel-thread-execution/

3: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=st+async

4: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.5.pdf

5: https://forums.developer.nvidia.com/t/ptx-cp-async-purpose-of-ignore-src-and-prefetch-size/217746

6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

7: https://llvm.org/docs/NVPTXUsage.html

8: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979

9: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async

10: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

11: https://reviews.llvm.org/D150820?id=523455

12: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf

13: https://docs.nvidia.com/cuda/cuda-programming-guide/03-advanced/advanced-kernel-programming.html

🏁 Script executed:

# Let's search for the actual PTX documentation reference or any existing implementation # that uses cp.async in the codebase to understand the pattern rg "cp\.async" include/flashinfer/ --type cuh --type h -A 2 -B 2

Repository: flashinfer-ai/flashinfer

Length of output: 95

🌐 Web query:

PTX ISA cp.async zero-fill semantics src-size parameter must be less than copy size

💡 Result:

In NVIDIA PTX ISA, the cp.async instructions (e.g., cp.async.cg.shared.global) support an optional src_size parameter for zero-fill semantics. When src_size is specified and is less than the copy size (typically 16 bytes), the instruction copies src_size bytes from the source and zero-fills the remaining bytes in the destination. This allows handling unaligned or partial copies while ensuring the full destination granularity (16 bytes) is written with zeros in the unused portion, preventing data leakage or corruption in shared memory. This feature is used in high-performance kernels for efficient asynchronous global-to-shared copies, especially with predicates or odd-sized tiles, as seen in CUTLASS and CCCL implementations (e.g., cp_async_zfill traits). The requirement ensures valid operation without reading beyond source bounds, aligning with hardware constraints on Ampere (SM_80+) and later architectures. Official PTX docs confirm cp.async variants with src_size for zero behavior, though exact syntax is in instruction set details (PTX ISA 8.5+). Primary sources: NVIDIA PTX ISA docs (versions 8.5-9.2), CUTLASS issues/PRs on cp.async.zfill, LLVM NVPTX intrinsics adding src_size to nvvm_cp_async*, NVIDIA CCCL cp.async.bulk implementations requiring 16-byte multiples with alignment.

Citations:

1: https://docs.nvidia.com/cuda/parallel-thread-execution/

2: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async

3: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf

4: https://docs.nvidia.com/cuda/parallel-thread-execution/contents.html

5: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

6: https://ita9naiwa.github.io/mlsys/2025/10/05/ptx-mental-model.html

7: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

8: cp async fallback NVIDIA/cutlass#761

9: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716

10: https://reviews.llvm.org/D150820?id=523455

🌐 Web query:

NVIDIA PTX cp.async.ca.shared.global src-size equal to copy-size valid

💡 Result:

Yes, for the NVIDIA PTX instruction cp.async.ca.shared.global, the source size (determined by the operand type, e.g., .u16 for 16 bytes, .u8 for 8 bytes) must equal the copy size. This instruction performs fixed-size asynchronous copies from shared memory to global memory, where the copy granularity is specified by the type size. Using mismatched sizes is invalid and may lead to undefined behavior or assembly errors. This follows the standard PTX async copy semantics where operand type dictates transfer size, as seen in related cp.async instructions and forum discussions on size requirements (e.g., 16-byte minimum for certain variants). Official PTX ISA documentation confirms fixed-size transfers for these operations, with no variable src-size parameter in the syntax.

Citations:

1: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219

2: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443

3: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

4: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979

5: https://reviews.llvm.org/D100394

🏁 Script executed:

# Let's search for actual documentation or CUTLASS usage patterns rg "cp\.async" include/flashinfer/ -A 2 -B 2

Repository: flashinfer-ai/flashinfer

Length of output: 9023

🌐 Web query:

NVIDIA PTX ISA 8.8 cp.async.ca.shared.global instruction specification detailed syntax

💡 Result:

The NVIDIA PTX ISA 8.8 documentation is available as a PDF at https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf (published 2025-05-31). The cp.async.ca.shared.global instruction is an asynchronous copy operation from global memory to shared memory, introduced in PTX ISA for compute capability 8.0+ (Hopper and later architectures like sm_90). Syntax: cp.async.ca.shared.global [<dst_shared_address>], [<src_global_address>], ; - ca: cache allocation hint (cache at L2). - shared.global: specifies destination in shared memory, source in global memory. - It queues an asynchronous transfer; use cp.async.commit_group, cp.async.wait_group, or cp.async.wait_all to manage groups and synchronization. - Only supports global-to-shared direction (not shared-to-global for this variant). - Detailed semantics, modifiers, and examples are in PTX ISA 8.8 section 9.7.9 "Data Movement and Conversion Instructions" subsection on asynchronous copies (cp.async family), around 9.7.9.25.3.1. HTML docs at https://docs.nvidia.com/cuda/parallel-thread-execution/ reference the same content, with contents listing relevant sections like 9.7.9.25 Asynchronous Copies. Forums confirm usage for global-to-shared async loads with cache hints.

Citations:

1: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf

2: https://docs.nvidia.com/cuda/hopper-tuning-guide/parallel-thread-execution/contents.html

3: https://docs.nvidia.com/cuda/hopper-tuning-guide/parallel-thread-execution/index.html

4: https://www.aidoczh.com/cuda/parallel-thread-execution/contents.html

5: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

6: https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219

7: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-only-supports-global-to-shared-memory-copy/245979

8: https://stackoverflow.com/questions/66145682/how-do-compute-capabilities-7-x-8-x-assist-cooperative-group-operations

🌐 Web query:

CUTLASS cp.async.zfill implementation copy size src size parameter semantics

💡 Result:

In CUTLASS's CuTe library, cp.async.zfill refers to asynchronous copy operations using PTX cp.async.*.zfill instructions, wrapped in Copy_Atom traits like SM80_CP_ASYNC_CACHEGLOBAL_ZFILL. The copy size is determined by the source tensor partition size (thr_copy.partition_S(tG)) and destination partition size (thr_copy.partition_D(tS)), typically vector types like uint128_t for 16 bytes. For non-zfill variants (e.g., SM80_CP_ASYNC_CACHEGLOBAL), when used with predication (copy_if), there's a bug where it silently re-dispatches to the _ZFILL trait (lines 77-82 in copy_traits_sm80.hpp), causing zero-filling of predicated-out destination locations regardless of predication, leading to buffer overwriting. ZFILL variants explicitly zero-fill the destination beyond the source data extent, using the guard bit in cp.async.zfill to specify bytes to write vs. zero. The src size parameter semantics: the copy atom loads the full vector from source (if available), but zfill ensures destination lanes beyond source valid data are zeroed. Predication should mask entire copies, but the bug causes zfill behavior even without _ZFILL. Workaround: manually loop with if(p(i)) copy(...) or use ZFILL explicitly if zero-filling is desired. Primary sources confirm the implementation details and bug behavior.

Citations:

1: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716

2: https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/copy_traits_sm90_im2col.hpp

3: fix Copy_Traits<SM80_CP_ASYNC_*_ZFILL> without definition of with(pred) function NVIDIA/cutlass#1887

4: cp async fallback NVIDIA/cutlass#761

5: https://ipd.graylab.jhu.edu/rfdiffusion2/cutlass-3.5.1/include/cute/atom/

🏁 Script executed:

# Let's check how cp.async.cg (the working variant) differs from cp.async.ca cat -n include/flashinfer/cp_async.cuh | sed -n '80,160p'

Repository: flashinfer-ai/flashinfer

Length of output: 3793

🌐 Web query:

PTX ISA cp.async.ca vs cp.async.cg cache hints semantics differences

💡 Result:

In NVIDIA PTX ISA, cp.async is an asynchronous copy instruction used primarily for efficient data movement between global memory and shared memory in CUDA kernels, introduced in Ampere (SM_80+) architectures. The suffixes .ca and .cg are cache hint operators applied to the source operand (typically global memory). cp.async.ca.shared.global [shared_dest], [global_src], size; - .ca: Cache at L2 (request data to be cached in L2 cache). Suitable for data reuse across warps or blocks where L2 hit is beneficial. cp.async.cg.shared.global [shared_dest], [global_src], size; - .cg: Cache at global (streaming load to L1/L2 but with global caching policy, bypassing or weakly caching in L1, streaming semantics). Used for sequential/streaming accesses with less reuse, reducing L1 pollution. Key differences in semantics: - .ca prefers L2 caching for better hit rates in reuse scenarios. - .cg optimizes for streaming, minimizing cache pressure in L1 while still leveraging L2. These hints guide hardware prefetching and caching behavior but are advisory; actual behavior depends on GPU architecture (Ampere, Hopper). Behavior may differ across architectures (e.g., Ampere vs Hopper as noted in forums). For precise details, refer to PTX ISA documentation section on Data Movement Instructions and Cache Operators (9.7.9.1). cp.async.bulk variants exist for larger transfers with additional qualifiers.

Citations:

1: https://docs.nvidia.com/cuda/parallel-thread-execution/

2: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async

3: https://research.meekolab.com/messing-around-with-gpus-again

4: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443

5: https://docs.nvidia.com/cuda/parallel-thread-execution/contents.html

6: https://docs.nvidia.com/cuda/cuda-programming-guide/03-advanced/advanced-kernel-programming.html

7: https://forums.developer.nvidia.com/t/cuda-ptx-cp-async-cg-performs-differently-on-ampere-and-hopper/298612

🌐 Web query:

"cp.async" "src_in_bytes" "cp_size" PTX semantics zero fill behavior specification

💡 Result:

The PTX instruction "cp.async" is NVIDIA's asynchronous copy instruction for GPU memory operations, introduced in SM_80+ architectures (Ampere and later). It enables non-blocking data transfers, typically from global to shared memory, overlapping compute with memory copies. Higher-level CUDA APIs like cuda::memcpy_async compile to cp.async PTX under the hood. Key semantics from PTX ISA and related docs: - Syntax examples: cp.async[.cg][.collective].shared.global[.collective_type][.sem][.wait_id][.noinc][.nocomplete][.src_in_bytes][.cp_size][.tenant] [dst], [src], src_size, [mbarrier]; - .src_in_bytes: Specifies source size in bytes (allows flexible sizing). - .cp_size: Likely specifies copy size or related parameter (exact PTX docs needed for full syntax). - Operations are asynchronous; completion tracked via mbarrier (memory barrier) objects, with instructions like cp.async.mbarrier.arrive, cp.async.wait_group, etc. - Variants include cp.async.bulk for bulk transfers, cp.async.bulk.tensor for tensor memory access, with qualifiers like .mbarrier::complete_tx::bytes, .multicast::cluster (SM_90a+). Zero fill behavior: No explicit mention in available sources of automatic zero-filling for cp.async. Unlike some bulk copy instructions (e.g., certain TMA or cp.async.bulk with .ignore_oob), standard cp.async copies the specified bytes without documented zero-padding or zero-fill for untransferred portions. If source size < destination allocation or partial copies occur, destination may retain prior contents (garbage) unless explicitly initialized. For zero-initialization, use separate memzero or cp.async with zero source. Confirm with latest PTX ISA 9.2 docs section on "Asynchronous Copy Instructions" for precise behavior on misalignment/partial copies. Authoritative sources: NVIDIA PTX ISA 9.2 (docs.nvidia.com/cuda/parallel-thread-execution/), CUDA Programming Guide async copies section.

Citations:

1: https://docs.nvidia.com/cuda/parallel-thread-execution/

2: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

3: https://github.com/HazyResearch/ThunderKittens/pull/97/files

4: https://reviews.llvm.org/D100394

5: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=async

6: https://reviews.llvm.org/D150820?id=523455

7: https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg232455.html

🏁 Script executed:

# Let me look at the complete pred_load_32b function to see the exact implementation cat -n include/flashinfer/cp_async.cuh | sed -n '226,270p'

Repository: flashinfer-ai/flashinfer

Length of output: 1967

🌐 Web query:

NVIDIA PTX ISA cp.async "src_size must be less than copy size" zero fill constraint specification

💡 Result:

The NVIDIA PTX ISA cp.async instructions (introduced in SM_80/Ampere and later) perform asynchronous copies, typically from global to shared memory. Variants like cp.async.zfill (or _ZFILL in CUTLASS) zero-fill the destination beyond the available source data. No official NVIDIA PTX documentation explicitly states a constraint "src_size must be less than copy size". Searches across PTX ISA docs (v8.5, v8.8, v9.2 HTML/PDF), CUDA Programming Guide, and developer forums did not yield this exact phrase or detailed src_size parameter spec. However, evidence from libraries and tools indicates related constraints on copy granularity for zero-fill modes: - cp.async requires minimum transfer sizes: 4 bytes minimum (Triton-lang error for 2 bytes [6]); 16 bytes for certain modifiers like .cg [3]. - CUTLASS cp_async_zfill falls back to synchronous copies for 1-2 byte accesses, implying hardware cp.async.zfill expects ElementSize >=4 bytes (4,8,16) [7]. - LLVM/Clang NVPTX intrinsics note optional src_size argument in later CUDA 11+ [15]. - cp.async.bulk requires 16-byte alignment and size multiple of 16 [1,8]. The constraint likely refers to a compiler/assembler validation or library assertion in zero-fill variants (e.g., CUTLASS), where src_size (available valid source bytes) must be < copy_size (fixed hardware granule, e.g., 16B) to trigger zero-filling the remainder. If src_size == copy_size, use non-zfill cp.async; if src_size > copy_size, invalid/overflow. This ensures proper zero-padding for misaligned/trailing data without source reads beyond bounds. For authoritative spec, consult latest PTX ISA PDF (e.g., ptx_isa_9.2.pdf "Asynchronous Copy" section in Data Movement Instructions, 9.7.15 or similar), as HTML versions lack full instruction details. Forum [2] discusses ignore-src (ignores bytes > src_size, zero-fills), supporting partial-src semantics.

Citations:

1: https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

2: https://forums.developer.nvidia.com/t/cg-cache-modifier-with-async-copy-size-of-less-than-16/349443

3: https://docs.nvidia.com/cuda/parallel-thread-execution/

4: error: cp.async does not support transfers smaller than 4 bytes (2) triton-lang/triton#8023

5: cp async fallback NVIDIA/cutlass#761

6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

7: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

8: https://reviews.llvm.org/D100394

9: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.8.pdf

🏁 Script executed:

# Check git history or tests for pred_load_128b to see if this pattern is validated git log --oneline include/flashinfer/cp_async.cuh | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 77

🏁 Script executed:

# Search for actual usage of these functions to understand context rg "pred_load_128b|pred_load_32b" include/flashinfer/ --type c --type cpp --type h | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Let's search for actual usage of pred_load_128b to see if it's tested or used rg "pred_load_128b\|pred_load_32b" --type-add "cuda:*.cu" --type-add "cuh:*.cuh" -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if there are any tests that validate the cp.async behavior find . -name "*test*" -type f | grep -i "cp_async\|async" | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Let's verify the exact semantics by looking at the comment in the kNoFill branch # which explicitly states what cp.async does cat -n include/flashinfer/cp_async.cuh | sed -n '210,225p'

Repository: flashinfer-ai/flashinfer

Length of output: 619

🏁 Script executed:

# Check if there are any related issues or PRs discussing this pattern git log --all --oneline --grep="cp.async\|pred_load" 2>/dev/null | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 223

🏁 Script executed:

# Let's check the commit that mentions "zero fill for cp.async" to understand the semantics git show 541d25b --stat | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 323

🏁 Script executed:

# Check the diff of that commit to see what was changed git show 541d25b | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 1504

🏁 Script executed:

# Let's also check the most recent commit that modified this file git log --oneline -1 include/flashinfer/cp_async.cuh

Repository: flashinfer-ai/flashinfer

Length of output: 77

🏁 Script executed:

# Check when pred_load_128b_from_64b was added git log --oneline include/flashinfer/cp_async.cuh | grep -i "load_128b\|from_64b" || git log -p include/flashinfer/cp_async.cuh | grep -B 5 "pred_load_128b_from_64b" | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Let's look at the actual line numbers and comments more carefully cat -n include/flashinfer/cp_async.cuh | sed -n '180,200p'

Repository: flashinfer-ai/flashinfer

Length of output: 1366

🏁 Script executed:

# Let me check the comment more carefully - the code says "cp.async always zeros the upper 8 bytes" # This is a critical claim about cp.async behavior cat -n include/flashinfer/cp_async.cuh | sed -n '200,215p'

Repository: flashinfer-ai/flashinfer

Length of output: 739

🌐 Web query:

cp.async "always zeros" upper bytes PTX behavior semantics

💡 Result:

In NVIDIA PTX, the cp.async instruction family performs asynchronous copies, typically from global to shared memory, available on SM 8.0+ (Ampere+). Standard cp.async variants (e.g., cp.async.ca.shared.global [.b32|.b64|.b128]) require source and destination alignments matching the vector size (4/8/16 bytes), and copies exact byte counts specified. Misalignment or smaller sizes lead to PTX assembly errors or fallback behaviors in libraries. A key semantic is the ZFILL variant (e.g., cp.async.cg.shared.global.zfill [.b32|.b64]), which always zero-fills the upper unused bytes in the destination vector when the source provides fewer bytes than the vector width. This is used in libraries like CUTLASS for handling unaligned or sub-vector copies safely, preventing buffer overwrites or garbage data. For example, in CUTLASS Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL_ZFILL<...>>, ZFILL ensures upper bytes are zeroed, while non-ZFILL may silently dispatch to ZFILL or cause overwriting issues with predicates (as noted in CUTLASS issues #1716, #1768). Non-ZFILL cp.async expects full vector loads; partial loads trigger errors like "unexpected value '8', expected 16". PTX requires 16-byte alignment for many async copies, with size multiples of vector width. Upper bytes behavior: ZFILL explicitly zeros them ("always zeros"); standard variants copy source data exactly or undefined on misalignment. This matches developer observations in CUTLASS PR #761 (fallbacks for 1-2 byte cases) and issues where non-ZFILL overwrites with zeros unexpectedly due to trait dispatch.

Citations:

1: cp async fallback NVIDIA/cutlass#761

2: [BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy NVIDIA/cutlass#1716

3: fix Copy_Traits<SM80_CP_ASYNC_*_ZFILL> without definition of with(pred) function NVIDIA/cutlass#1887

4: [QST] SM80_CP_ASYNC_CACHEGLOBAL doesn't allow anything but 128 bit NVIDIA/cutlass#1768

5: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-copies.html

6: https://gevtushenko.github.io/cccl/libcudacxx/ptx/instructions/cp_async_bulk.html

🏁 Script executed:

# Let me check if there's a .zfill variant mentioned anywhere in the file rg "zfill|ZFILL" include/flashinfer/cp_async.cuh

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check what the actual difference is between the kFillZero and kNoFill branches # and understand the intended zero-fill semantics cat -n include/flashinfer/cp_async.cuh | sed -n '191,224p' | cat -A

Repository: flashinfer-ai/flashinfer

Length of output: 1786

cp.async copy size is incorrect; the 64b helper cannot zero the upper half of the 128b destination.

PTX cp.async only zero-fills bytes beyond the declared copy size (cp_size), so cp_size=8 with src_size=8 cannot zero bytes 8-15 of the destination. The 64b helper needs cp_size=16 with variable src_size (8 when predicate is true, 0 when false) to properly load 64 bits and zero-fill the upper 64 bits. Similarly, the 32b helper's fill_mode=kFillZero branch needs restructuring: use cp_size=4 for non-zero predicate paths, and cp_size=4, src_size=0 for explicit zero-fill when predicate is false.

Suggested fix sketch

template <PrefetchMode prefetch_mode, SharedMemFillMode fill_mode, typename T> __device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr, bool predicate) { `#ifdef` FLASHINFER_CP_ASYNC_ENABLED uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); if constexpr (fill_mode == SharedMemFillMode::kFillZero) { int src_in_bytes = predicate ? 8 : 0; - asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), - "l"(gmem_ptr), "n"(8), "r"(src_in_bytes)); + asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), + "l"(gmem_ptr), "n"(16), "r"(src_in_bytes)); } else { asm volatile( "{\n" " .reg .pred p;\n" " setp.ne.b32 p, %0, 0;\n" " `@p` cp.async.ca.shared.global [%1], [%2], %3, %4;\n" "}\n" ::"r"((int)predicate), - "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8)); + "r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8)); } `#else` ... `#endif` } template <SharedMemFillMode fill_mode> __device__ __forceinline__ void pred_load_32b(uint32_t* smem_ptr, const uint32_t* gmem_ptr, bool predicate) { `#ifdef` FLASHINFER_CP_ASYNC_ENABLED uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr)); if constexpr (fill_mode == SharedMemFillMode::kFillZero) { - int src_in_bytes = predicate ? 4 : 0; - asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), - "l"(gmem_ptr), "n"(4), "r"(src_in_bytes)); + if (predicate) { + asm volatile("cp.async.ca.shared.global [%0], [%1], %2;\n" ::"r"(smem_int_ptr), + "l"(gmem_ptr), "n"(4)); + } else { + asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), + "l"(gmem_ptr), "n"(4), "r"(0)); + } } else { ... } `#else` ... `#endif` }

Also applies to: 236-244

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/cp_async.cuh` around lines 191 - 224, The cp.async call in pred_load_128b_from_64b uses cp_size=8 which doesn’t zero the upper 8 bytes; change the cp.async invocation(s) to use cp_size=16 and pass src_size as a variable (src_size = predicate ? 8 : 0) so cp.async zero-fills bytes 8..15 when src_size is 0, and similarly adjust the kNoFill branch to issue cp.async with cp_size=16 and src_size conditionally 8 or 0 (instead of cp_size=8); also apply the same fix pattern to the 32b helper described in the comment: use cp_size appropriate to the full destination (e.g., 16 for 128b destination or 4 for 32b helper) and make src_size variable (0 when wanting explicit zero-fill, nonzero when copying) so cp.async actually zeros the upper bytes.

coderabbitai · 2026-03-19T12:07:25Z

include/flashinfer/vec_dtypes.cuh

+#if (defined __CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 13) && \
+    (defined __CUDACC_VER_MINOR__) && (__CUDACC_VER_MINOR__ >= 2)
+      // cvt.rn.bf16x2.e2m1x2 requires CUDA Toolkit >= 13.2
+      asm volatile(
+          "{\n"
+          ".reg .b8 fp4_byte;\n"
+          "mov.b32 {fp4_byte, _, _, _}, %1;\n"
+          "cvt.rn.bf16x2.e2m1x2 %0, fp4_byte;\n"
+          "}"
+          : "=r"(y)
+          : "r"(b));
+#else
+      // Fallback: convert e2m1 -> fp16 -> bf16 when cvt.rn.bf16x2.e2m1x2 is unavailable
+      uint32_t fp16x2;
+      asm volatile(
+          "{\n"
+          ".reg .b8 fp4_byte;\n"
+          "mov.b32 {fp4_byte, _, _, _}, %1;\n"
+          "cvt.rn.f16x2.e2m1x2 %0, fp4_byte;\n"
+          "}"
+          : "=r"(fp16x2)
+          : "r"(b));
+      __half2 h2 = reinterpret_cast<__half2&>(fp16x2);
+      __nv_bfloat162 bf16x2 = __float22bfloat162_rn(__half22float2(h2));
+      y = reinterpret_cast<uint32_t&>(bf16x2);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

For NVCC, what values do __CUDACC_VER_MAJOR__ and __CUDACC_VER_MINOR__ take on CUDA 14.0+, and what preprocessor expression is recommended for checking "CUDA Toolkit >= 13.2"?

💡 Result:

For NVCC in CUDA 14.0+, CUDACC_VER_MAJOR is the major version number of nvcc (e.g., 14 for CUDA 14.x) and CUDACC_VER_MINOR is the minor version number of nvcc (e.g., 0 for CUDA 14.0). These macros are predefined by nvcc when compiling CUDA source files and correspond to the nvcc compiler version, which matches the CUDA Toolkit version. The recommended preprocessor expression for checking "CUDA Toolkit >= 13.2" is (CUDACC_VER_MAJOR > 13) || (CUDACC_VER_MAJOR == 13 && CUDACC_VER_MINOR >= 2), assuming nvcc version aligns with toolkit version as per release notes (e.g., CUDA NVCC 13.2.51 for CUDA 13.2).

Citations:

1: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/

2: https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf

3: https://stackoverflow.com/questions/50648671/how-do-i-convert-a-cudacc-ver-value-into-a-major-minor-build-triplet

4: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

5: http://developer.nvidia.com/cuda-toolkit/whatsnew

Fix the CUDA-version gate to enable the fast path on CUDA 14.x and later.

Lines 486-487 check __CUDACC_VER_MAJOR__ >= 13 and separately __CUDACC_VER_MINOR__ >= 2, which breaks on CUDA 14.0 (where __CUDACC_VER_MAJOR__=14 and __CUDACC_VER_MINOR__=0). The condition should combine the version numbers to properly detect CUDA >= 13.2.

Suggested fix

-#if (defined __CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 13) && \ - (defined __CUDACC_VER_MINOR__) && (__CUDACC_VER_MINOR__ >= 2) +#if defined(__CUDACC_VER_MAJOR__) && defined(__CUDACC_VER_MINOR__) && \ + ((__CUDACC_VER_MAJOR__ * 10000 + __CUDACC_VER_MINOR__ * 100) >= 130200)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/vec_dtypes.cuh` around lines 486 - 510, The CUDA version gate incorrectly requires both __CUDACC_VER_MAJOR__ >= 13 and __CUDACC_VER_MINOR__ >= 2, which fails for CUDA 14.x; update the preprocessor condition that guards the fast-path asm (the block using cvt.rn.bf16x2.e2m1x2 and variable y/b) to check the combined version (e.g., compare major/minor together or compute a single numeric version) so it enables the fast path for CUDA >= 13.2 (including 14.0+).

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/attention/test_batch_prefill_kernels.py (1)
1151-1165: Please add an asymmetric head_dim_qk/head_dim_vo case here.

head_dim_vo is omitted, so this suite only covers the default head_dim_vo == head_dim_qk path. A 192/128-style case would exercise the packed-V sizing logic that the current fixture cannot catch. The same omission exists in tests/attention/test_batch_attention.py, so it would be worth updating both together.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_batch_prefill_kernels.py` around lines 1151 - 1165, The
test only exercises symmetric head sizes because wrapper.plan is called with
head_dim only; add an asymmetric case by passing explicit head_dim_qk and
head_dim_vo arguments to wrapper.plan (for example head_dim_qk=192,
head_dim_vo=128) so the packed-V sizing logic is exercised; update the
wrapper.plan invocation in tests/attention/test_batch_prefill_kernels.py (and
mirror the same change in tests/attention/test_batch_attention.py) to include
these two explicit parameters instead of relying on the default head_dim
equality.
include/flashinfer/attention/prefill.cuh (1)
449-498: Consider adding null check for defensive programming.

If sf_ptr is nullptr but is_fp4_type_v<DTypeKV> is true, the function computes offsets and calls pred_load_32b with an invalid source pointer. While the design assumes FP4 usage implies scales are provided, a null check would add robustness:
 template <bool produce_v, typename KTraits, typename IdType>
 __device__ __forceinline__ void page_produce_kv_sf(
     typename KTraits::SharedStorage* smem_storage, uint8_t* sf_ptr,
     ...) {
   if constexpr (!is_fp4_type_v<typename KTraits::DTypeKV>) return;
+  if (sf_ptr == nullptr) return;
This prevents undefined behavior if FP4 is compiled but scales are accidentally omitted at runtime.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/attention/prefill.cuh` around lines 449 - 498, The
function page_produce_kv_sf may dereference sf_ptr when is_fp4_type_v<typename
KTraits::DTypeKV> is true; add a defensive null check at the start of
page_produce_kv_sf (after the is_fp4_type_v constexpr) that returns early if
sf_ptr == nullptr to avoid computing sf_gmem_offset and calling
cp_async::pred_load_32b with an invalid source pointer; keep the check
independent of produce_v and ensure it triggers before the NUM_SF_ITERS loop so
symbols page_produce_kv_sf, sf_ptr, is_fp4_type_v, and cp_async::pred_load_32b
are addressed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/prefill.py`:
- Around line 737-738: The code currently assumes packed NVFP4 V/O width equals
q.shape[-1] when kv_block_scales (or key_block_scales/value_block_scales)
indicate a packed V cache, which breaks configs where head_dim_vo !=
head_dim_qk; update the run path to use the planned head_dim_vo instead of
q.shape[-1] for packed outputs and persist the planned head_dim_vo from plan()
onto self (e.g., self.head_dim_vo) so run() can read it; adjust any branches
that check kv_block_scales/key_block_scales/value_block_scales to select
self.head_dim_vo as the V/O width when packed is detected.

In `@include/flashinfer/cp_async.cuh`:
- Around line 192-223: pred_load_128b_from_64b: ensure the cp.async path zeroes
the upper 8 bytes to match the fallback by changing the assembly copy size to 16
while keeping src-size=8 (i.e. use cp.async.ca.shared.global with cp-size=16,
src-size=8) in both the fill-mode (kFillZero) branch and the kNoFill branch so
the upper half of the 16-byte slot is zero-padded when only 8 bytes are sourced;
keep the predicate logic and the fallback (smem_u64[1] = 0) unchanged.

---

Nitpick comments:
In `@include/flashinfer/attention/prefill.cuh`:
- Around line 449-498: The function page_produce_kv_sf may dereference sf_ptr
when is_fp4_type_v<typename KTraits::DTypeKV> is true; add a defensive null
check at the start of page_produce_kv_sf (after the is_fp4_type_v constexpr)
that returns early if sf_ptr == nullptr to avoid computing sf_gmem_offset and
calling cp_async::pred_load_32b with an invalid source pointer; keep the check
independent of produce_v and ensure it triggers before the NUM_SF_ITERS loop so
symbols page_produce_kv_sf, sf_ptr, is_fp4_type_v, and cp_async::pred_load_32b
are addressed.

In `@tests/attention/test_batch_prefill_kernels.py`:
- Around line 1151-1165: The test only exercises symmetric head sizes because
wrapper.plan is called with head_dim only; add an asymmetric case by passing
explicit head_dim_qk and head_dim_vo arguments to wrapper.plan (for example
head_dim_qk=192, head_dim_vo=128) so the packed-V sizing logic is exercised;
update the wrapper.plan invocation in
tests/attention/test_batch_prefill_kernels.py (and mirror the same change in
tests/attention/test_batch_attention.py) to include these two explicit
parameters instead of relying on the default head_dim equality.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bddea1b0-077d-4606-bfad-a761fbad1553

📥 Commits

Reviewing files that changed from the base of the PR and between c4a159a and 27930f3.

📒 Files selected for processing (14)

flashinfer/attention.py
flashinfer/jit/attention/modules.py
flashinfer/jit/utils.py
flashinfer/prefill.py
flashinfer/quantization/fp4_quantization.py
flashinfer/utils.py
include/flashinfer/attention/persistent.cuh
include/flashinfer/attention/prefill.cuh
include/flashinfer/cp_async.cuh
include/flashinfer/frag_layout_swizzle.cuh
include/flashinfer/permuted_smem.cuh
include/flashinfer/vec_dtypes.cuh
tests/attention/test_batch_attention.py
tests/attention/test_batch_prefill_kernels.py

coderabbitai · 2026-03-19T12:20:36Z

flashinfer/prefill.py

+                key_block_scales,
+                value_block_scales,


⚠️ Potential issue | 🔴 Critical

Use the planned V/O width for packed NVFP4 outputs.

kv_block_scales only tells you the V cache is packed; it does not imply head_dim_vo == head_dim_qk. Using q.shape[-1] here breaks valid planned configurations with asymmetric QK/VO widths, because plan() accepts a separate head_dim_vo but run() no longer uses it once the packed path is enabled.

🐛 Suggested fix

- out_head_dim = q.shape[-1] if kv_block_scales is not None else v_cache.shape[-1] + out_head_dim = ( + self._head_dim_vo + if self._cached_kv_data_type == torch.uint8 + else v_cache.shape[-1] + )

Also persist head_dim_vo from plan() onto self so the run path can reuse the planned value.

Also applies to: 2243-2255

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/prefill.py` around lines 737 - 738, The code currently assumes packed NVFP4 V/O width equals q.shape[-1] when kv_block_scales (or key_block_scales/value_block_scales) indicate a packed V cache, which breaks configs where head_dim_vo != head_dim_qk; update the run path to use the planned head_dim_vo instead of q.shape[-1] for packed outputs and persist the planned head_dim_vo from plan() onto self (e.g., self.head_dim_vo) so run() can read it; adjust any branches that check kv_block_scales/key_block_scales/value_block_scales to select self.head_dim_vo as the V/O width when packed is detected.

coderabbitai · 2026-03-19T12:20:36Z

include/flashinfer/cp_async.cuh

+__device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr,
+                                                        bool predicate) {
+#ifdef FLASHINFER_CP_ASYNC_ENABLED
+  uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+  if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
+    int src_in_bytes = predicate ? 8 : 0;
+    asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
+                 "l"(gmem_ptr), "n"(8), "r"(src_in_bytes));
+
+  } else {
+    // kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes
+    asm volatile(
+        "{\n"
+        " .reg .pred p;\n"
+        " setp.ne.b32 p, %0, 0;\n"
+        " @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n"
+        "}\n" ::"r"((int)predicate),
+        "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8));
+  }
+#else
+  if (predicate) {
+    uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr);
+    smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr);
+    // smem_u64[1] = 0xcdcdcdcdcdcdcdcd;
+    smem_u64[1] = 0;
+  } else {
+    if constexpr (fill_mode == SharedMemFillMode::kFillZero) {
+      *((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0);
+      // *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef);
+    }
+  }
+#endif


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In NVIDIA PTX for cp.async.ca.shared.global [dst], [src], cp-size, src-size, if cp-size=8andsrc-size=8, are bytes beyond the first 8 destination bytes touched or zero-filled?

💡 Result:

No, bytes beyond the first 8 destination bytes are neither touched nor zero-filled when cp-size=8 and src-size=8 in the PTX instruction cp.async.ca.shared.global [dst], [src], 8, 8. The instruction copies exactly cp-size (8) bytes from the source (up to src-size bytes available) to the destination, without affecting additional bytes in shared memory.

Citations:

1: https://docs.modular.com/mojo/stdlib/gpu/memory/async_copy

2: https://ita9naiwa.github.io/mlsys/2025/10/05/ptx-mental-model.html

3: https://forums.developer.nvidia.com/t/ptx-cp-async-purpose-of-ignore-src-and-prefetch-size/217746

4: cp async fallback NVIDIA/cutlass#761

pred_load_128b_from_64b leaves the upper half of the slot uninitialized.

cp.async.ca.shared.global [dst], [src], 8, src-size writes only 8 bytes; the remaining 8 bytes of the 16-byte shared-memory slot are left untouched and retain stale data. The fallback path explicitly zeroes the upper half with smem_u64[1] = 0, creating divergent behavior. Use cp-size=16 with src-size=8 to ensure zero-padding matches the fallback path.

🐛 Suggested fix

- int src_in_bytes = predicate ? 8 : 0; - asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), - "l"(gmem_ptr), "n"(8), "r"(src_in_bytes)); + int src_in_bytes = predicate ? 8 : 0; + asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr), + "l"(gmem_ptr), "n"(16), "r"(src_in_bytes));

- "r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8)); + "r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

__device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr,

bool predicate) {

#ifdef FLASHINFER_CP_ASYNC_ENABLED

uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));

if constexpr (fill_mode == SharedMemFillMode::kFillZero) {

int src_in_bytes = predicate ? 8 : 0;

asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),

"l"(gmem_ptr), "n"(8), "r"(src_in_bytes));

} else {

// kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes

asm volatile(

"{\n"

" .reg .pred p;\n"

" setp.ne.b32 p, %0, 0;\n"

" @p cp.async.ca.shared.global [%1], [%2], %3, %4;\n"

"}\n" ::"r"((int)predicate),

"r"(smem_int_ptr), "l"(gmem_ptr), "n"(8), "n"(8));

}

#else

if (predicate) {

uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr);

smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr);

// smem_u64[1] = 0xcdcdcdcdcdcdcdcd;

smem_u64[1] = 0;

} else {

if constexpr (fill_mode == SharedMemFillMode::kFillZero) {

*((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0);

// *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef);

}

}

#endif

__device__ __forceinline__ void pred_load_128b_from_64b(T* smem_ptr, const T* gmem_ptr,

bool predicate) {

`#ifdef` FLASHINFER_CP_ASYNC_ENABLED

uint32_t smem_int_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));

if constexpr (fill_mode == SharedMemFillMode::kFillZero) {

int src_in_bytes = predicate ? 8 : 0;

asm volatile("cp.async.ca.shared.global [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),

"l"(gmem_ptr), "n"(16), "r"(src_in_bytes));

} else {

// kNoFill: only issue the copy if predicate is true; cp.async always zeros the upper 8 bytes

asm volatile(

"{\n"

" .reg .pred p;\n"

" setp.ne.b32 p, %0, 0;\n"

" `@p` cp.async.ca.shared.global [%1], [%2], %3, %4;\n"

"}\n" ::"r"((int)predicate),

"r"(smem_int_ptr), "l"(gmem_ptr), "n"(16), "n"(8));

}

`#else`

if (predicate) {

uint64_t* smem_u64 = reinterpret_cast<uint64_t*>(smem_ptr);

smem_u64[0] = *reinterpret_cast<const uint64_t*>(gmem_ptr);

// smem_u64[1] = 0xcdcdcdcdcdcdcdcd;

smem_u64[1] = 0;

} else {

if constexpr (fill_mode == SharedMemFillMode::kFillZero) {

*((uint4*)smem_ptr) = make_uint4(0, 0, 0, 0);

// *((uint4*)smem_ptr) = make_uint4(0xefef, 0xefef, 0xefef, 0xefef);

}

}

`#endif`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/cp_async.cuh` around lines 192 - 223, pred_load_128b_from_64b: ensure the cp.async path zeroes the upper 8 bytes to match the fallback by changing the assembly copy size to 16 while keeping src-size=8 (i.e. use cp.async.ca.shared.global with cp-size=16, src-size=8) in both the fill-mode (kFillZero) branch and the kNoFill branch so the upper half of the 16-byte slot is zero-padded when only 8 bytes are sourced; keep the predicate logic and the fallback (smem_u64[1] = 0) unchanged.

flashinfer-bot · 2026-03-19T15:51:04Z

[FAILED] Pipeline #46514756: 6/20 passed

samuellees · 2026-03-20T02:37:55Z

/bot run

flashinfer-bot · 2026-03-20T02:38:32Z

GitLab MR !431 has been updated with latest changes, and the CI pipeline #46572392 is currently running. I'll report back once the pipeline job completes.

…2725) ## Summary SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks. **Changes:** - `jit/fused_moe.py`: Add major version 12 to `supported_major_versions` - `csrc/trtllm_fused_moe_kernel_launcher.cu`: `ICHECK_EQ(major, 10)` -> `ICHECK_GE(major, 10)` **Benchmark** (Qwen3.5-397B on 4x RTX PRO 6000 SM120): | Config | tok/s | Output | |--------|-------|--------| | compute_120f (CUDA 13.0) | 39.0 | Correct | | compute_120a (CUDA 12.8) | 14.6 | Correct (slow fallback) | | Marlin W4A16 | 46-49 | Correct | **Root cause:** All TMA WS grouped GEMM autotuner tactics fail on `compute_120a`, requiring `compute_120f` (CUDA 13.0). CuTe DSL `admissible_archs` in vendored CUTLASS also needs `sm_120a`/`sm_120f` (cpasync/copy.py, tcgen05/mma.py, arch/mbar.py, etc). Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577  ## Summary by CodeRabbit * **Bug Fixes** * Broadened GPU architecture checks to accept additional modern compute capabilities (SM 10.x and 12.x), improving compatibility and clearer SM reporting. * Improved compute-capability detection and encoding, preserving user-provided architecture suffixes and more accurately generating nvcc architecture flags. * Expanded JIT module generation to include additional CUDA majors so fused-MoE kernels run on more recent GPUs.  --------- Signed-off-by: Brandon Music <brandon.m.music@gmail.com> Co-authored-by: Brandon Music <brandonmmusic-max@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Music <brandonmusic@pop-os.tail8674da.ts.net>

flashinfer-bot · 2026-03-20T06:54:48Z

[FAILED] Pipeline #46572392: 6/20 passed

Tom-Zheng added 3 commits March 19, 2026 09:09

Support NVFP4 KV input for fa2 attention kernels

5ef26a6

format

bd33fc3

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

fix rebase

193a6c6

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

Tom-Zheng requested review from Anerudhan, bkryu, cyx-6, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, nvmbreughe, saltyminty and yzh119 as code owners March 19, 2026 11:30

cleanup

8aedc06

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

.

27930f3

Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 18, 2026

fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels #2725

Merged

Conversation

Tom-Zheng commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

samuellees commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Mar 19, 2026

Uh oh!

samuellees commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Tom-Zheng commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading