Skip to content

Fix bmm_fp8 cublasLt handle usage in autotuned cublas runner #26381#2808

Open
baonudesifeizhai wants to merge 1 commit intoflashinfer-ai:mainfrom
baonudesifeizhai:fixbmm_fp8cublas
Open

Fix bmm_fp8 cublasLt handle usage in autotuned cublas runner #26381#2808
baonudesifeizhai wants to merge 1 commit intoflashinfer-ai:mainfrom
baonudesifeizhai:fixbmm_fp8cublas

Conversation

@baonudesifeizhai
Copy link
Contributor

@baonudesifeizhai baonudesifeizhai commented Mar 17, 2026

📌 Description

🔍 Related Issues

vllm-project/vllm#26381
Root cause:
torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs. This can fail in autotune/profiling paths on Blackwell.

on vllm side
before: https://paste.ubuntu.com/p/npS2tkZY2c/
after : https://paste.ubuntu.com/p/c6Ys69PqvR/

and :

 CUDA_VISIBLE_DEVICES=4,5,6,7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 \
python examples/basic/offline_inference/generate.py \
  --model redhatai/meta-Llama-3.1-8B-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-tokens 32 \
  --temperature 0

:https://paste.ubuntu.com/p/rPYcXQSwPC/

image

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Enhanced segment GEMM operations with additional layout and configuration parameters for improved flexibility.
  • Refactor

    • Simplified FP8 batch matrix multiplication API by removing external handle parameter, reducing function call complexity.
    • Improved internal resource management with automatic handle lifecycle management.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the bmm_fp8 operation incorrectly handled cuBLASLt handles, causing stability problems during autotuning and profiling, particularly on Blackwell GPUs. The core of the fix involves refactoring the handle management to internally create and cache cublasLtHandle_t instances per device and thread, ensuring the correct handle type is always used. This change enhances the reliability and correctness of FP8 GEMM operations without requiring external handle management.

Highlights

  • Internalized cuBLASLt Handle Management: The cublasLtHandle_t is now created and managed internally within csrc/bmm_fp8.cu using a thread-local cache, eliminating the need to pass it as an argument.
  • Removed cublas_handle Parameter: The bmm_fp8 function signature in C++ and its Python binding have been updated to remove the int64_t cublas_handle parameter, simplifying its interface.
  • Improved Robustness for Autotuning: This change addresses a root cause where torch.cuda.current_blas_handle() returned a cublasHandle_t which was incorrectly reinterpreted as cublasLtHandle_t, leading to failures in autotune/profiling paths, especially on Blackwell architectures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/bmm_fp8.cu
    • Implemented a thread-local cublasLtHandle_t cache and creation mechanism.
    • Modified the bmm_fp8 function signature to remove the cublas_handle parameter.
    • Updated the bmm_fp8 implementation to retrieve the cublasLtHandle_t from the newly introduced internal cache.
  • csrc/flashinfer_gemm_binding.cu
    • Updated the declaration of the bmm_fp8 function to match the new signature, removing the cublas_handle parameter.
  • flashinfer/gemm/gemm_base.py
    • Removed the acquisition of torch.cuda.current_blas_handle().
    • Updated the call to module.bmm_fp8 to no longer pass the cublas_handle argument.
Activity
  • The author identified a root cause where torch.cuda.current_blas_handle() was misused, leading to cublasLtHandle_t reinterpretation issues.
  • The author provided links to 'before' and 'after' states on the vllm side, demonstrating the impact and resolution of the issue.
  • A specific CUDA command was provided to reproduce and verify the fix, indicating thorough testing by the author.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This pull request refactors cuBLASLt handle management by introducing thread-local caching per device in the GPU kernel code. The bmm_fp8 function signature is simplified by removing the explicit cublas_handle parameter, and corresponding updates are made to the binding layer and Python frontend. The CutlassSegmentGEMM function signature is also expanded with additional parameters for enhanced functionality.

Changes

Cohort / File(s) Summary
cuBLASLt Handle Caching
csrc/bmm_fp8.cu
Introduces ThreadLocalCublasLtHandles thread-local cache and get_cublaslt_handle() helper function to manage cuBLASLt handles per device automatically. Updates bmm_fp8() to obtain handles internally instead of accepting them as parameters.
Function Signature Updates
csrc/flashinfer_gemm_binding.cu
Removes int64_t cublas_handle parameter from bmm_fp8() signature. Extends CutlassSegmentGEMM() with three new parameters: TensorView y_ld, TensorView empty_x_data, and bool weight_column_major.
Python Frontend Alignment
flashinfer/gemm/gemm_base.py
Removes explicit cuBLAS handle acquisition and passing in the FP8 GEMM forward path. Calls module.bmm_fp8() without the handle parameter, relying on internal handle management.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

op: comm

Suggested reviewers

  • nvmbreughe
  • djmmoss
  • yongwww
  • bkryu
  • yzh119

Poem

🐰 Thread-local caches hop with glee,
No more handles passed so carelessly,
The kernels manage their own way,
With automatic cleanup each day! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description clearly identifies the root cause, provides related issue link, and includes reproduction steps, but the Description section is empty per the template. Fill in the Description section explaining what the PR does and why these changes are needed to fix the cublasLt handle issue.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title directly reflects the main change: fixing bmm_fp8 cublasLt handle usage issue in the autotuned cublas runner.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug where a cublasHandle_t was being incorrectly reinterpreted as a cublasLtHandle_t. The fix introduces a thread-local cache to manage cublasLtHandle_t instances, which is a robust and appropriate solution. The implementation is clean and effectively resolves the issue. I have one suggestion to enhance error reporting during resource cleanup.

Comment on lines +29 to +33
for (auto& [_, handle] : handles) {
if (handle != nullptr) {
(void)cublasLtDestroy(handle);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The destructor for ThreadLocalCublasLtHandles silently ignores the return status of cublasLtDestroy. While it's correct not to throw an exception from a destructor, logging a failure to stderr would provide valuable diagnostic information if resource cleanup fails. This can happen during program shutdown when the CUDA context might no longer be valid, and logging would help debug potential resource leaks.

    for (auto& [_, handle] : handles) {
      if (handle != nullptr) {
        if (cublasStatus_t status = cublasLtDestroy(handle); status != CUBLAS_STATUS_SUCCESS) {
          // Cannot throw in a destructor, but logging to stderr is helpful for debugging.
          std::cerr << "[FlashInfer] Warning: cublasLtDestroy failed in destructor with status: "
                    << cublasGetStatusString(status) << std::endl;
        }
      }
    }

@yzh119
Copy link
Collaborator

yzh119 commented Mar 18, 2026

/bot run

@yzh119 yzh119 added the run-ci label Mar 18, 2026
@flashinfer-bot
Copy link
Collaborator

GitLab MR !427 has been created, and the CI pipeline #46412613 is currently running. I'll report back once the pipeline job completes.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @baonudesifeizhai thanks for the fix, it's look good.

torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs

I noticed that cublasLtHandle was also used in pytorch, do you know if it's exposed in python (https://github.com/pytorch/pytorch/blob/98849e6ad451547ea09a958757e6bb422996e65f/aten/src/ATen/cuda/CublasHandlePool.cpp#L417)?

@baonudesifeizhai
Copy link
Contributor Author

baonudesifeizhai commented Mar 18, 2026

only have this torch.cuda.current_blas_handle() . but it lose context , can only get an integer handle for the current state at the Python layer.

maybe we can use at::cuda::getCurrentCUDABlasHandle() ..but i want decoupling from pytorch side...

Hi @baonudesifeizhai thanks for the fix, it's look good.

torch.cuda.current_blas_handle() returns cublasHandle_t, but bmm_fp8 reinterpreted it as cublasLtHandle_t before calling cublasLt APIs

I noticed that cublasLtHandle was also used in pytorch, do you know if it's exposed in python (https://github.com/pytorch/pytorch/blob/98849e6ad451547ea09a958757e6bb422996e65f/aten/src/ATen/cuda/CublasHandlePool.cpp#L417)?

@flashinfer-bot
Copy link
Collaborator

[CANCELING] Pipeline #46412613: canceled

@yzh119
Copy link
Collaborator

yzh119 commented Mar 19, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !427 has been created, and the CI pipeline #46557796 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #46557796: 13/20 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants