Fix bmm_fp8 cublasLt handle usage in autotuned cublas runner #26381 by baonudesifeizhai · Pull Request #2808 · flashinfer-ai/flashinfer

baonudesifeizhai · 2026-03-17T21:13:51Z

📌 Description

🔍 Related Issues

vllm-project/vllm#26381
Root cause:
torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs. This can fail in autotune/profiling paths on Blackwell.

on vllm side
before: https://paste.ubuntu.com/p/npS2tkZY2c/
after : https://paste.ubuntu.com/p/c6Ys69PqvR/

and :

 CUDA_VISIBLE_DEVICES=4,5,6,7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 \
python examples/basic/offline_inference/generate.py \
  --model redhatai/meta-Llama-3.1-8B-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-tokens 32 \
  --temperature 0

:https://paste.ubuntu.com/p/rPYcXQSwPC/

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Enhanced segment GEMM operations with additional layout and configuration parameters for improved flexibility.
Refactor
- Simplified FP8 batch matrix multiplication API by removing external handle parameter, reducing function call complexity.
- Improved internal resource management with automatic handle lifecycle management.

gemini-code-assist · 2026-03-17T21:14:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the bmm_fp8 operation incorrectly handled cuBLASLt handles, causing stability problems during autotuning and profiling, particularly on Blackwell GPUs. The core of the fix involves refactoring the handle management to internally create and cache cublasLtHandle_t instances per device and thread, ensuring the correct handle type is always used. This change enhances the reliability and correctness of FP8 GEMM operations without requiring external handle management.

Highlights

Internalized cuBLASLt Handle Management: The cublasLtHandle_t is now created and managed internally within csrc/bmm_fp8.cu using a thread-local cache, eliminating the need to pass it as an argument.
Removed cublas_handle Parameter: The bmm_fp8 function signature in C++ and its Python binding have been updated to remove the int64_t cublas_handle parameter, simplifying its interface.
Improved Robustness for Autotuning: This change addresses a root cause where torch.cuda.current_blas_handle() returned a cublasHandle_t which was incorrectly reinterpreted as cublasLtHandle_t, leading to failures in autotune/profiling paths, especially on Blackwell architectures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/bmm_fp8.cu
- Implemented a thread-local cublasLtHandle_t cache and creation mechanism.
- Modified the bmm_fp8 function signature to remove the cublas_handle parameter.
- Updated the bmm_fp8 implementation to retrieve the cublasLtHandle_t from the newly introduced internal cache.
csrc/flashinfer_gemm_binding.cu
- Updated the declaration of the bmm_fp8 function to match the new signature, removing the cublas_handle parameter.
flashinfer/gemm/gemm_base.py
- Removed the acquisition of torch.cuda.current_blas_handle().
- Updated the call to module.bmm_fp8 to no longer pass the cublas_handle argument.

Activity

The author identified a root cause where torch.cuda.current_blas_handle() was misused, leading to cublasLtHandle_t reinterpretation issues.
The author provided links to 'before' and 'after' states on the vllm side, demonstrating the impact and resolution of the issue.
A specific CUDA command was provided to reproduce and verify the fix, indicating thorough testing by the author.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-17T21:14:24Z

📝 Walkthrough

Walkthrough

This pull request refactors cuBLASLt handle management by introducing thread-local caching per device in the GPU kernel code. The bmm_fp8 function signature is simplified by removing the explicit cublas_handle parameter, and corresponding updates are made to the binding layer and Python frontend. The CutlassSegmentGEMM function signature is also expanded with additional parameters for enhanced functionality.

Changes

Cohort / File(s)	Summary
cuBLASLt Handle Caching `csrc/bmm_fp8.cu`	Introduces `ThreadLocalCublasLtHandles` thread-local cache and `get_cublaslt_handle()` helper function to manage cuBLASLt handles per device automatically. Updates `bmm_fp8()` to obtain handles internally instead of accepting them as parameters.
Function Signature Updates `csrc/flashinfer_gemm_binding.cu`	Removes `int64_t cublas_handle` parameter from `bmm_fp8()` signature. Extends `CutlassSegmentGEMM()` with three new parameters: `TensorView y_ld`, `TensorView empty_x_data`, and `bool weight_column_major`.
Python Frontend Alignment `flashinfer/gemm/gemm_base.py`	Removes explicit cuBLAS handle acquisition and passing in the FP8 GEMM forward path. Calls `module.bmm_fp8()` without the handle parameter, relying on internal handle management.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

op: comm

Suggested reviewers

nvmbreughe
djmmoss
yongwww
bkryu
yzh119

Poem

🐰 Thread-local caches hop with glee,
No more handles passed so carelessly,
The kernels manage their own way,
With automatic cleanup each day! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description clearly identifies the root cause, provides related issue link, and includes reproduction steps, but the Description section is empty per the template.	Fill in the Description section explaining what the PR does and why these changes are needed to fix the cublasLt handle issue.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title directly reflects the main change: fixing bmm_fp8 cublasLt handle usage issue in the autotuned cublas runner.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request correctly addresses a bug where a cublasHandle_t was being incorrectly reinterpreted as a cublasLtHandle_t. The fix introduces a thread-local cache to manage cublasLtHandle_t instances, which is a robust and appropriate solution. The implementation is clean and effectively resolves the issue. I have one suggestion to enhance error reporting during resource cleanup.

gemini-code-assist · 2026-03-17T21:15:39Z

csrc/bmm_fp8.cu

+    for (auto& [_, handle] : handles) {
+      if (handle != nullptr) {
+        (void)cublasLtDestroy(handle);
+      }
+    }


The destructor for ThreadLocalCublasLtHandles silently ignores the return status of cublasLtDestroy. While it's correct not to throw an exception from a destructor, logging a failure to stderr would provide valuable diagnostic information if resource cleanup fails. This can happen during program shutdown when the CUDA context might no longer be valid, and logging would help debug potential resource leaks.

for (auto& [_, handle] : handles) { if (handle != nullptr) { if (cublasStatus_t status = cublasLtDestroy(handle); status != CUBLAS_STATUS_SUCCESS) { // Cannot throw in a destructor, but logging to stderr is helpful for debugging. std::cerr << "[FlashInfer] Warning: cublasLtDestroy failed in destructor with status: " << cublasGetStatusString(status) << std::endl; } } }

yzh119 · 2026-03-18T08:30:01Z

/bot run

flashinfer-bot · 2026-03-18T08:30:50Z

GitLab MR !427 has been created, and the CI pipeline #46412613 is currently running. I'll report back once the pipeline job completes.

yzh119

Hi @baonudesifeizhai thanks for the fix, it's look good.

torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs

I noticed that cublasLtHandle was also used in pytorch, do you know if it's exposed in python (https://github.com/pytorch/pytorch/blob/98849e6ad451547ea09a958757e6bb422996e65f/aten/src/ATen/cuda/CublasHandlePool.cpp#L417)?

baonudesifeizhai · 2026-03-18T09:28:08Z

only have this torch.cuda.current_blas_handle() . but it lose context , can only get an integer handle for the current state at the Python layer.

maybe we can use at::cuda::getCurrentCUDABlasHandle() ..but i want decoupling from pytorch side...

Hi @baonudesifeizhai thanks for the fix, it's look good.

torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs

I noticed that cublasLtHandle was also used in pytorch, do you know if it's exposed in python (https://github.com/pytorch/pytorch/blob/98849e6ad451547ea09a958757e6bb422996e65f/aten/src/ATen/cuda/CublasHandlePool.cpp#L417)?

flashinfer-bot · 2026-03-18T18:15:57Z

[CANCELING] Pipeline #46412613: canceled

yzh119 · 2026-03-19T22:08:14Z

/bot run

flashinfer-bot · 2026-03-19T22:09:17Z

GitLab MR !427 has been created, and the CI pipeline #46557796 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-20T02:23:48Z

[FAILED] Pipeline #46557796: 13/20 passed

add fix

a52c5e6

baonudesifeizhai requested review from IwakuraRein, bkryu, cyx-6, jimmyzho, nv-yunzheq, yongwww and yzh119 as code owners March 17, 2026 21:13

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

yzh119 added the run-ci label Mar 18, 2026

yzh119 reviewed Mar 18, 2026

View reviewed changes

Conversation

baonudesifeizhai commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Mar 18, 2026

Uh oh!

flashinfer-bot commented Mar 18, 2026

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flashinfer-bot commented Mar 18, 2026

Uh oh!

yzh119 commented Mar 19, 2026

Uh oh!

flashinfer-bot commented Mar 19, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baonudesifeizhai commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

baonudesifeizhai commented Mar 18, 2026 •

edited

Loading