Skip to content

Conversation

@samuellees
Copy link
Contributor

📌 Description

This PR implements NVFP4 KV Cache for SM100. Currently it's a draft PR for test.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @samuellees, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the FlashInfer library by integrating native NVFP4 KV Cache support for SM100 GPUs. The changes span across C++ CUDA kernels and Python interfaces, enabling more efficient memory utilization and faster attention computations. Key improvements include a sophisticated kernel selection mechanism that adapts to various generation scenarios, precise handling of FP4 data types and their scaling factors, and refined memory access patterns through updated TMA descriptors. These enhancements aim to boost the performance and scalability of large language models on NVIDIA's latest architecture.

Highlights

  • NVFP4 KV Cache Support: Introduced comprehensive support for NVFP4 (NVIDIA FP4) Key-Value (KV) cache on SM100 GPUs, including new parameters for block scaling factors in kernel launchers and Python bindings.
  • Dynamic Kernel Selection and Optimization: Implemented advanced kernel selection heuristics for generation tasks (MLA and GQA), dynamically choosing optimal kernels based on parameters like tileSizeQ and numTokensPerCtaQ. This includes a cost model for GQA generation to select the best tileSizeQ.
  • CUDA Kernel Parameter Refinements: Modified core CUDA kernel parameter structures (KernelParams, TllmGenSelectKernelParams) to accommodate FP4 data types, new stride calculations, and optimized integer division (FastModDivInt32).
  • TMA Descriptor Updates for FP4: Updated TMA (Tensor Memory Access) descriptor building logic to correctly handle FP4 data, including specific data formats (CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B), swizzling, and precise bit-level calculations for strides and memory access.
  • SM Compatibility and Kernel Loading: Enhanced kernel loading and selection to intelligently handle SM (Streaming Multiprocessor) compatibility, preferring specific SM versions over family versions (e.g., SM100 vs SM100f) to ensure optimal kernel execution.
  • Generation Mask Type Adjustment: Changed the default attention mask type for generation kernels from Dense to Causal, with an accompanying explanation for this adjustment to better reflect the processing of single tokens per CTA.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements NVFP4 KV Cache for SM100, introducing changes across CUDA kernel launchers, kernel implementation, and Python bindings. The core logic for handling FP4 data types, including scale factors and packed data, appears to be correctly implemented. The kernel selection logic in fmhaKernels.cuh has been significantly refactored for better heuristics and maintainability. However, I've identified a few issues: there are several leftover debugging statements (fflush(stdout)) that should be removed, and more critically, a hardcoded value for mSumOfSeqLensKv in kernelParams.h that seems to be a bug and needs correction.

Comment on lines +929 to +930
// params.mSumOfSeqLensKv = options.mSumOfSeqLensKv;
params.mSumOfSeqLensKv = 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The mSumOfSeqLensKv parameter is hardcoded to 64, while the correct assignment from options.mSumOfSeqLensKv is commented out. This will likely cause incorrect behavior for cases where the sum of KV sequence lengths is not 64. Please restore the original logic.

    params.mSumOfSeqLensKv = options.mSumOfSeqLensKv;

int64_t workspace_size, Optional<TensorView> attention_sinks,
Optional<TensorView> cum_seq_lens_q, Optional<TensorView> key_block_scales,
Optional<TensorView> value_block_scales) {
fflush(stdout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This fflush(stdout) call appears to be a leftover debugging statement. It should be removed from production code to avoid potential performance impacts and unnecessary console output.

int numCtasX = numCtasPerSeqQ;
// printf("[numCtasX trace] Initial: numCtasX = %d (numCtasPerSeqQ = %d)\n", numCtasX,
// numCtasPerSeqQ);
fflush(stdout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This fflush(stdout) call seems to be a debugging artifact and should be removed from the final code.

// printf("[numCtasX trace] After multiCtasKv: numCtasX = %d (was %d, multiplied by
// numCtasPerSeqKv=%d), maxNumCtasPerSeqKv= %d\n", numCtasX, oldNumCtasX, numCtasPerSeqKv,
// maxNumCtasPerSeqKv);
fflush(stdout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Another fflush(stdout) call that looks like a leftover from debugging. Please remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants