Conversation
Implement SM120 blockwise FP8 scaled matrix multiplication kernels using CUTLASS v4.x while maintaining CUTLASS v3.9.2 for SM89/SM90/SM100 archs. Changes: - Add CUTLASS v4.x FetchContent for SM120 kernel compilation - Add enable_sm120_only guard in common.hpp - Add cutlass_3x_gemm_sm120 template using Sm120 collective builders - Add SM120 per-tensor and blockwise FP8 kernels and dispatch logic - Add runtime dispatch to route SM120 GPUs to dedicated kernels - Configure CMake to build SM120 sources with v4.x includes This enables FP8 quantization on RTX PRO 6000 / RTX 5060 Ti GPUs.
Upgrade pytorch_triton to >=3.6.0 from PyTorch nightly to enable Triton MoE kernels on Blackwell (SM120) GPUs. Tested: fused_moe_kernel compiles and runs successfully on RTX 5060 Ti.
Triton's TritonGPUAccelerateMatmul MLIR pass crashes on SM120. Detect SM120 and fallback to PyTorch iterative MoE implementation. Performance note: This is ~2-4x slower than Triton fused MoE but allows MoE models to run on Blackwell GPUs until Triton is fixed.
Extend SM120 fallback to Fp8MoEMethod (not just UnquantizedFusedMoEMethod). Remove Triton 3.6.0 upgrade as it breaks PyTorch inductor compatibility.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update