Skip to content

Gm/sm120 fp8#15

Merged
gmorgachev merged 4 commits intogm/poc-layer-expfrom
gm/sm120-fp8
Feb 16, 2026
Merged

Gm/sm120 fp8#15
gmorgachev merged 4 commits intogm/poc-layer-expfrom
gm/sm120-fp8

Conversation

@gmorgachev
Copy link

@gmorgachev gmorgachev commented Feb 16, 2026

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Implement SM120 blockwise FP8 scaled matrix multiplication kernels using
CUTLASS v4.x while maintaining CUTLASS v3.9.2 for SM89/SM90/SM100 archs.

Changes:
- Add CUTLASS v4.x FetchContent for SM120 kernel compilation
- Add enable_sm120_only guard in common.hpp
- Add cutlass_3x_gemm_sm120 template using Sm120 collective builders
- Add SM120 per-tensor and blockwise FP8 kernels and dispatch logic
- Add runtime dispatch to route SM120 GPUs to dedicated kernels
- Configure CMake to build SM120 sources with v4.x includes

This enables FP8 quantization on RTX PRO 6000 / RTX 5060 Ti GPUs.
Upgrade pytorch_triton to >=3.6.0 from PyTorch nightly to enable
Triton MoE kernels on Blackwell (SM120) GPUs.

Tested: fused_moe_kernel compiles and runs successfully on RTX 5060 Ti.
Triton's TritonGPUAccelerateMatmul MLIR pass crashes on SM120.
Detect SM120 and fallback to PyTorch iterative MoE implementation.

Performance note: This is ~2-4x slower than Triton fused MoE but
allows MoE models to run on Blackwell GPUs until Triton is fixed.
Extend SM120 fallback to Fp8MoEMethod (not just UnquantizedFusedMoEMethod).
Remove Triton 3.6.0 upgrade as it breaks PyTorch inductor compatibility.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@gmorgachev gmorgachev merged commit 93d97e2 into gm/poc-layer-exp Feb 16, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant