Minor updates (#152)

fzyzcjy · web-flow · commit d0a47e99a135 · 2025-06-16T18:48:52.000+08:00
diff --git a/blog/2025-06-16-gb200-part-1.md b/blog/2025-06-16-gb200-part-1.md
@@ -11,10 +11,10 @@ The GB200 NVL72 is the world's most advanced hardware for AI training and infere
 
 As a preliminary work, we integrated the following components into SGLang:
 
-* **Blackwell DeepGEMM**: A high-performance General Matrix Multiply (GEMM) library tailored for FP8 precision, rewritten to fully exploit the Blackwell architecture. Quantization and packing are introduced for input scales in the new API.
+* **Blackwell DeepGEMM**: A high-performance General Matrix Multiplication (GEMM) library tailored for FP8 precision, rewritten to fully exploit the Blackwell architecture. Quantization and packing are introduced for input scales in the new API, and the newly introduced UMMA feature are used for fast matrix multiplications.
 * **Blackwell DeepEP**: A communication library designed to shuffle tokens for routed experts in Mixture of Experts (MoE). The new NVLink-only environment is supported by mapping remote GPU memory to the local virtual address space. We also slightly improved DeepEP performance by 15%.
 * **FlashInfer Blackwell FMHA**: A high-performance Fused Multi-Head Attention (FMHA) kernel for DeepSeek prefilling, rewritten to support Blackwell architecture.
-* **Blackwell CUTLASS MLA**: A Multi-head Latent Attention (MLA) kernel optimized for the Blackwell architecture, leveraging a 2-SM cluster design.
+* **Blackwell CUTLASS MLA**: A Multi-Head Latent Attention (MLA) kernel optimized for Blackwell architecture. It leverages the new UMMA feature and enables [2-SM](https://github.com/NVIDIA/cutlass/blob/main/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp#L119) cluster mode for TMA, reducing L2 read traffic on the KV cache.
 * **Blackwell Mooncake**: A transfer engine utilized in Key-Value (KV) cache transfer for prefill-decode disaggregation. It also employs techniques similar to DeepEP to support NVLink.
 
 ## Experiments