NVFP4 KV Cache on SM100 #2363

samuellees · 2026-01-16T00:56:44Z

📌 Description

This PR implements NVFP4 KV Cache for SM100. Currently it's a draft PR for test.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

coderabbitai · 2026-01-16T00:56:49Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-16T00:57:19Z

Summary of Changes

Hello @samuellees, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the FlashInfer library by integrating native NVFP4 KV Cache support for SM100 GPUs. The changes span across C++ CUDA kernels and Python interfaces, enabling more efficient memory utilization and faster attention computations. Key improvements include a sophisticated kernel selection mechanism that adapts to various generation scenarios, precise handling of FP4 data types and their scaling factors, and refined memory access patterns through updated TMA descriptors. These enhancements aim to boost the performance and scalability of large language models on NVIDIA's latest architecture.

Highlights

NVFP4 KV Cache Support: Introduced comprehensive support for NVFP4 (NVIDIA FP4) Key-Value (KV) cache on SM100 GPUs, including new parameters for block scaling factors in kernel launchers and Python bindings.
Dynamic Kernel Selection and Optimization: Implemented advanced kernel selection heuristics for generation tasks (MLA and GQA), dynamically choosing optimal kernels based on parameters like tileSizeQ and numTokensPerCtaQ. This includes a cost model for GQA generation to select the best tileSizeQ.
CUDA Kernel Parameter Refinements: Modified core CUDA kernel parameter structures (KernelParams, TllmGenSelectKernelParams) to accommodate FP4 data types, new stride calculations, and optimized integer division (FastModDivInt32).
TMA Descriptor Updates for FP4: Updated TMA (Tensor Memory Access) descriptor building logic to correctly handle FP4 data, including specific data formats (CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B), swizzling, and precise bit-level calculations for strides and memory access.
SM Compatibility and Kernel Loading: Enhanced kernel loading and selection to intelligently handle SM (Streaming Multiprocessor) compatibility, preferring specific SM versions over family versions (e.g., SM100 vs SM100f) to ensure optimal kernel execution.
Generation Mask Type Adjustment: Changed the default attention mask type for generation kernels from Dense to Causal, with an accompanying explanation for this adjustment to better reflect the processing of single tokens per CTA.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements NVFP4 KV Cache for SM100, introducing changes across CUDA kernel launchers, kernel implementation, and Python bindings. The core logic for handling FP4 data types, including scale factors and packed data, appears to be correctly implemented. The kernel selection logic in fmhaKernels.cuh has been significantly refactored for better heuristics and maintainability. However, I've identified a few issues: there are several leftover debugging statements (fflush(stdout)) that should be removed, and more critically, a hardcoded value for mSumOfSeqLensKv in kernelParams.h that seems to be a bug and needs correction.

gemini-code-assist · 2026-01-16T00:58:41Z

include/flashinfer/trtllm/fmha/kernelParams.h

+    // params.mSumOfSeqLensKv = options.mSumOfSeqLensKv;
+    params.mSumOfSeqLensKv = 64;


The mSumOfSeqLensKv parameter is hardcoded to 64, while the correct assignment from options.mSumOfSeqLensKv is commented out. This will likely cause incorrect behavior for cases where the sum of KV sequence lengths is not 64. Please restore the original logic.

params.mSumOfSeqLensKv = options.mSumOfSeqLensKv;

gemini-code-assist · 2026-01-16T00:58:41Z

csrc/trtllm_fmha_kernel_launcher.cu

+    int64_t workspace_size, Optional<TensorView> attention_sinks,
+    Optional<TensorView> cum_seq_lens_q, Optional<TensorView> key_block_scales,
+    Optional<TensorView> value_block_scales) {
+  fflush(stdout);


This fflush(stdout) call appears to be a leftover debugging statement. It should be removed from production code to avoid potential performance impacts and unnecessary console output.

gemini-code-assist · 2026-01-16T00:58:41Z

include/flashinfer/trtllm/fmha/fmhaKernels.cuh

    int numCtasX = numCtasPerSeqQ;
+    // printf("[numCtasX trace] Initial: numCtasX = %d (numCtasPerSeqQ = %d)\n", numCtasX,
+    // numCtasPerSeqQ);
+    fflush(stdout);


This fflush(stdout) call seems to be a debugging artifact and should be removed from the final code.

gemini-code-assist · 2026-01-16T00:58:41Z

include/flashinfer/trtllm/fmha/fmhaKernels.cuh

+      // printf("[numCtasX trace] After multiCtasKv: numCtasX = %d (was %d, multiplied by
+      // numCtasPerSeqKv=%d), maxNumCtasPerSeqKv= %d\n", numCtasX, oldNumCtasX, numCtasPerSeqKv,
+      // maxNumCtasPerSeqKv);
+      fflush(stdout);


Another fflush(stdout) call that looks like a leftover from debugging. Please remove it.

PerkzZheng and others added 4 commits December 24, 2025 14:31

update trtllm-gen to support groups tokens and headsQ

c5b673a

address comments

e7f9ba0

small fix

e2734cd

Add support for trtllm gen nvfp4 kv cache (none-interleave)

747b167

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

samuellees added 4 commits January 16, 2026 01:16

Disable checksum and download

1cff3fb

Add unit test

2f5c251

Add unit test

894d017

fix

708180c

This was referenced Jan 20, 2026

Feature Request: Support NVFP4 KV Cache Attention Kernel on SM100 through TRTLLM-GEN Backend #2293

Open

[Feature]: NVFP4 KV Cache Support vllm-project/vllm#32220

Open

sempervictus mentioned this pull request Jan 24, 2026

Switch to all-in-one flash-attn v2/v3 library guoqingbao/vllm.rs#199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 KV Cache on SM100 #2363

NVFP4 KV Cache on SM100 #2363

Uh oh!

samuellees commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026 •

edited

Loading

Review skipped

Other AI code review bot(s) detected

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// params.mSumOfSeqLensKv = options.mSumOfSeqLensKv;
		params.mSumOfSeqLensKv = 64;

NVFP4 KV Cache on SM100 #2363

Are you sure you want to change the base?

NVFP4 KV Cache on SM100 #2363

Uh oh!

Conversation

samuellees commented Jan 16, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Other AI code review bot(s) detected

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 16, 2026 •

edited

Loading