Skip to content

Commit d0a47e9

Browse files
authored
Minor updates (#152)
1 parent 306412f commit d0a47e9

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

blog/2025-06-16-gb200-part-1.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@ The GB200 NVL72 is the world's most advanced hardware for AI training and infere
1111

1212
As a preliminary work, we integrated the following components into SGLang:
1313

14-
* **Blackwell DeepGEMM**: A high-performance General Matrix Multiply (GEMM) library tailored for FP8 precision, rewritten to fully exploit the Blackwell architecture. Quantization and packing are introduced for input scales in the new API.
14+
* **Blackwell DeepGEMM**: A high-performance General Matrix Multiplication (GEMM) library tailored for FP8 precision, rewritten to fully exploit the Blackwell architecture. Quantization and packing are introduced for input scales in the new API, and the newly introduced UMMA feature are used for fast matrix multiplications.
1515
* **Blackwell DeepEP**: A communication library designed to shuffle tokens for routed experts in Mixture of Experts (MoE). The new NVLink-only environment is supported by mapping remote GPU memory to the local virtual address space. We also slightly improved DeepEP performance by 15%.
1616
* **FlashInfer Blackwell FMHA**: A high-performance Fused Multi-Head Attention (FMHA) kernel for DeepSeek prefilling, rewritten to support Blackwell architecture.
17-
* **Blackwell CUTLASS MLA**: A Multi-head Latent Attention (MLA) kernel optimized for the Blackwell architecture, leveraging a 2-SM cluster design.
17+
* **Blackwell CUTLASS MLA**: A Multi-Head Latent Attention (MLA) kernel optimized for Blackwell architecture. It leverages the new UMMA feature and enables [2-SM](https://github.com/NVIDIA/cutlass/blob/main/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp#L119) cluster mode for TMA, reducing L2 read traffic on the KV cache.
1818
* **Blackwell Mooncake**: A transfer engine utilized in Key-Value (KV) cache transfer for prefill-decode disaggregation. It also employs techniques similar to DeepEP to support NVLink.
1919

2020
## Experiments

0 commit comments

Comments
 (0)