Skip to content

FBGEMM_GPU v1.5.0 Release Notes

Latest

Choose a tag to compare

@gchalump gchalump released this 27 Jan 02:35
· 164 commits to main since this release

Highlights

CUDA 13 and Blackwell Support

  • Enabled CUDA 13 builds in OSS with full preparation for next-generation GPU architectures (#5143, #5100,#5301)
  • Added lazy TMEM allocation for Blackwell decode kernel for improved memory efficiency (#5262)
  • Added support for Blackwell CUTLASS attention kernels in torch.compile (#5136)
  • Added Paged Attention support to FMHA CUTLASS Blackwell Forward kernel for both fixed and variable length sequences (#4999, #5033)
  • Upgraded CUTLASS dependency to 4.3 with SM100 convolution fixes (#5127, #5047)

Table Batched Embedding (TBE) Improvements

  • Added hash_zch_identities and hash_zch_runtime_meta streaming logic for improved ZCH (Zero Collision Hashing) support (#5144, #5194)
  • Introduced KVZCHEvictionTBEConfig for flexible KVZCH eviction configuration (#5058)
  • Added sync trigger eviction support with Python API and all2all synchronization (#4984, #5062)
  • Added feature score eviction policy with no-eviction mode support (#5059)

GenAI and GEMM Performance

  • Added split-K support and heuristics for decode attention kernel, improving inference performance (#5213, #5225)
  • Added sliding window attention support to split-K generation kernel (#5231)
  • Added FP16 support for CUTLASS grouped GEMM operations (#5111)
  • Improved kleidi-ai matmul register usage and matrix partitioning for better performance (#5165, #5155)
  • Optimized FmhaKernelBwdConvert block size and grid shape (#5229)

Quantization Improvements

  • Enabled direct MX4→BF16 dequantization to reduce memory footprint (#5206)
  • Added MXFP8 grouped GEMM improvements with better heuristics and assertions (#5190, #5203)
  • Enabled specifying output dtype for FP8 quantized communication (#5154)
  • Added FP8 Convolution Kernel with improved heuristics (#4994, #5118)
  • NVFP4 grouped tuning and alignment with eager PyTorch numerics (#5012, #5156)

ARM / AArch64 Platform Support

  • Added multiple NEON-optimized quantization implementations for ARM64 (#5089, #5115, #5199)
  • Vectorized requantize_ for Arm64 with NEON intrinsics (#5130)
  • Improved kleidi-ai matmul for ARM architecture (#5155, #5165)

ROCm / AMD Platform Support

  • Added MI350 performance optimizations for embedding forward and backward passes (#5064, #5177)
  • Updated OSS build script to support AMD and CPU variants (#5257)
  • Updated default target ROCm architectures in OSS build (#5219)

Better Engineering

  • Upgraded GitHub Actions to latest versions for improved CI reliability (#5223)
  • Upgraded CUTLASS dependency to version 4.3 (#5127)
  • Improved sparse ops with Kineto tracing support for better profiling (#5060, #5061)
  • Added comprehensive FMHA tests and improved test organization (#5108, #5237)

Software Requirements

FBGEMM_GPU v1.5.0 has been tested and known to work on the following setups:

  • PyTorch: v2.10
  • CUDA: v12.6, 12.8, 12.9, 13.0
  • Python: v3.10, 3.11, 3.12, 3.13, 3.14
  • ROCm: v6.3, 6.4, 7.0 (with MI350X support)
  • ARM/AArch64: Apple Silicon, ARM64 builds enabled

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.5.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

  • [Improvement] "[v2] Tune max segment length per cta in triton table batched embeddings" (#5274)
  • [New] Add robust field filtering in TBEDataConfig.from_json (#5164)
  • [New] Add hash_zch_runtime_meta streaming logic to TBE (#5194)
  • [Improvement] Bypass tbe eeg reporter when indices.numel == 0 (#5160)
  • [Improvement] shortcut for merge_pooled_embedding (#5147)
  • [New] Add table_names as a instance variable of SplitTableBatchedEmbeddingBagsCodegen (#5133)
  • [New] Add hash_zch_identities to SplitTableBatchedEmbeddingBagsCodegen forward (#5144)
  • [Improvement] Change kvzch_eviction_tbe_config to kvzch_tbe_config (#5084)
  • [Improvement] embedding forward optimization for MI350 (#5064)
  • [Improvement] Map hash_zch_identities to corresponding unique indices in TBE (#5077)
  • [New] Adding KVZCHEvictionTBEConfig in FBGEEM (#5058)
  • [Improvement] disable random init in inference operator for embedding cache (#5026)

For CPU

SSD Table Batched Embedding (TBE) Operators

  • [Improvement] Update to batch processing delta update for kvzch and fix table idx issue. (#5148)
  • [Improvement] Change kvzch_eviction_tbe_config to kvzch_tbe_config (#5084)
  • [New] Support no eviction in Feature score eviction policy (#5059)
  • [Improvement] Free mem trigger with all2all for sync trigger eviction (#5062)
  • [New] Adding KVZCHEvictionTBEConfig in FBGEEM (#5058)
  • [New] Adding python api to support sync trigger evict (#4984)

Optimizer Support

  • [Improvement] Optimize FmhaKernelBwdConvert block size and grid shape (#5229)
  • [Improvement] backward performance optimization for MI350 (#5177)
  • [Improvement] Use LLM optimized rowwise quantization kernel (#5101)
  • [Improvement] optimization: move set_metadata out of main stream (#5082)
  • [Improvement] embedding forward optimization for MI350 (#5064)
  • [Improvement] group_index_select_or_add_2d_kernel forward pass optimization (#5080)

GenAI Support and Operators

Attention Kernels

  • [Improvement] reorganize blackwell_fmha_test.py a bit (#5237)
  • [Improvement] Ignore new blackwell attention splitk tests in FBGEMM CI (#5238)
  • [Improvement] Remove the guard around blackwell attention splitk tests (#5234)
  • [New] Add sliding window attention to splitk gen kernel (#5231)
  • [Improvement] Optimize FmhaKernelBwdConvert block size and grid shape (#5229)
  • [New] Add split-K heuristic for decode attention (#5225)
  • [New] Add split-K support for decode attention kernel (#5213)
  • [Improvement] fix example cpp run from paged attention (#5208)
  • [New] Support Blackwell CUTLASS attention kernels in torch.compile (#5136)
  • [Improvement] blackwell cutlass fmha changes ported 4.3.0 (#5131)
  • [Improvement] Fix uncoalesced global memory access in decode attention bf16 kernel (#5109)
  • [Improvement] Fix cutlass_blackwell_fmha_custom_op and add comprehensive FMHA tests (#5108)
  • [Improvement] cutlass_blackwell_fmha_gen make kernel call argument batch_idx optional. (#5102)
  • [Improvement] Use int64_t instead of int32_t in decode attention kernel to avoid index calc overflow (#5095)
  • [Improvement] Bring 4.2.1 changes to FBGEMM blackwell cutlass fmha (#5052)
  • [Improvement] Update return args for CutlassBlackwellFmhaFunc BWD (#5040)
  • [New] Add Paged Attention support to FMHA FWD CUTLASS kernel for variable length (#5033)
  • [New] Add Paged Attention to FMHA Cutlass Blackwell Forward kernel for fixed length (#4999)
  • [New] Add gqa to decode unit tests (#5016)

CUTLASS/GEMM Support

  • [Improvement] Dismantle pyramid of doom in set_grouped_gemm_args_kernel (#5272)
  • [Improvement] Don't read C matrix in mxfp8 grouped gemm (#5275)
  • [Improvement] Port part of cutlass decode PR to fix static assertion w/ TileShape<64, 256, 128> (#5232)
  • [Improvement] Ignore new blackwell attention splitk tests in FBGEMM CI (#5238)
  • [Improvement] Update FBGEMM versioning to 1.5.0 (#5230)
  • [Improvement] update pinned versions of cutlass for fbgemm and mslk (#5220)
  • [New] Add better assertions for MXFP8 group gemm (#5203)
  • [Improvement] Bump setuptools from 75.1.0 to 78.1.1 in /fbgemm_gpu (#5139)
  • [Improvement] Improve kleidi-ai matmul register usage (#5165)
  • [Improvement] fbgemm nvfp4 cast: align division numerics with eager PyTorch (#5156)
  • [Improvement] Improve kleidi-ai matmul matrix partitioning (#5155)
  • [New] Support Blackwell CUTLASS attention kernels in torch.compile (#5136)
  • [Improvement] blackwell cutlass fmha changes ported 4.3.0 (#5131)
  • [Improvement] Upgrade cutlass dependency to 4.3. (#5127)
  • [Improvement] update fbgemm genai install instructions (#5122)
  • [Improvement] Fix cutlass_blackwell_fmha_custom_op and add comprehensive FMHA tests (#5108)
  • [Improvement] Update fpgemm fp8 conv heuristic (#5118)
  • [New] Support fp16 for cutlass grouped GEMM (#5111)
  • [Improvement] Prepare FBGEMM_GPU for CUDA 13 builds (#5100)
  • [Improvement] cutlass_blackwell_fmha_gen make kernel call argument batch_idx optional. (#5102)
  • [Improvement] Refine Register fbgemm::sum_reduce_to_one (#5107)
  • [Improvement] Deprecate tl.async_task from fbgemm (#5094)
  • [Improvement] Cutlass Qtile Size shrunk to 64 (#5072)
  • [Improvement] Bring 4.2.1 changes to FBGEMM blackwell cutlass fmha (#5052)
  • [Improvement] Update return args for CutlassBlackwellFmhaFunc BWD (#5040)
  • [Improvement] Update CUTLASS in fbgemm_gpu for SM100 Convolution Fix (#5047)
  • [Improvement] FBGEMM fp8 conv for WAN 2.2 (#5043)
  • [New] Add Paged Attention support to FMHA FWD CUTLASS kernel for variable length (#5033)
  • [New] Add Paged Attention to FMHA Cutlass Blackwell Forward kernel for fixed length (#4999)
  • [Improvement] fix cutlass tmem sync issue (#5022)

Triton GEMM Support

  • [Improvement] "[v2] Tune max segment length per cta in triton table batched embeddings" (#5274)

KV Cache Support

Quantization Operators

  • [Improvement] Don't read C matrix in mxfp8 grouped gemm (#5275)
  • [New] Add FusedNBitRowwiseQuantizedSBHalfToFloatOrHalfNeon (#5199)
  • [New] Enable direct MX4→BF16 dequantization to reduce memory (cuda/cpp side) (1/2) (#5206)
  • [New] Add better assertions for MXFP8 group gemm (#5203)
  • [Improvement] Improve MXFP8 grouped heuristic (#5190)
  • [Improvement] Fix memory overflow issue that shows up with when OutType is uint8_t and input_stride > output_stride (#5173)
  • [New] Enable specifying output dtype for fp8 quantized communication (#5154)
  • [Improvement] fbgemm nvfp4 cast: align division numerics with eager PyTorch (#5156)
  • [Improvement] Fix tail handling in arm64 requantize_ (#5153)
  • [Improvement] Vectorize requantize_ for Arm64 with NEON intrinsics (#5130)
  • [New] Add NEON implementation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf (#5115)
  • [Improvement] Update fpgemm fp8 conv heuristic (#5118)
  • [Improvement] Use LLM optimized rowwise quantization kernel (#5101)
  • [New] Add NEON-based FloatOrHalfToFused8BitRowwiseQuantizedSBFloat (#5089)
  • [Improvement] FBGEMM fp8 conv for WAN 2.2 (#5043)
  • [Improvement] FP8 Convolution Kernel (#4994)
  • [Improvement] NVFP4 grouped tuning (#5012)

Sparse Operators

  • [Improvement] Use TORCH_CHECK_VALUE in sparse ops (#5215)
  • [Improvement] accelerate permute_1D_data_kernel (#5110)
  • [New] Adding Kineto support to bench:sparse_ops (#5060)
  • [New] Support 2D weights permute for strided keys (#5145)
  • [Improvement] Back out "bucket_permute kernel to accelerate bucktize in rebatching" (#5079)
  • [New] Support larger lookup in permute (#5086)
  • [Improvement] remove pt2 compliant xfails for jagged ops (#5068)
  • [New] Add kineto tracing to bench:jagged_tensor (#5061)
  • [Improvement] bucket_permute kernel to accelerate bucktize in rebatching (#5050)

Comm Operators

  • [New] Enable specifying output dtype for fp8 quantized communication (#5154)

Platform Support

CUDA / Blackwell / GB200

  • [New] Implement lazy TMEM allocation for Blackwell decode kernel (#5262)
  • [Improvement] reorganize blackwell_fmha_test.py a bit (#5237)
  • [Improvement] Ignore new blackwell attention splitk tests in FBGEMM CI (#5238)
  • [Improvement] Remove the guard around blackwell attention splitk tests (#5234)
  • [New] Support Blackwell CUTLASS attention kernels in torch.compile (#5136)
  • [Improvement] blackwell cutlass fmha changes ported 4.3.0 (#5131)
  • [Improvement] Fix cutlass_blackwell_fmha_custom_op and add comprehensive FMHA tests (#5108)
  • [New] Add CUDAGuard to ensure correct device (#5113)
  • [Improvement] Prepare FBGEMM_GPU for CUDA 13 builds (#5100)
  • [Improvement] cutlass_blackwell_fmha_gen make kernel call argument batch_idx optional. (#5102)
  • [Improvement] Bring 4.2.1 changes to FBGEMM blackwell cutlass fmha (#5052)
  • [Improvement] Update return args for CutlassBlackwellFmhaFunc BWD (#5040)
  • [Improvement] Update CUTLASS in fbgemm_gpu for SM100 Convolution Fix (#5047)
  • [New] Add Paged Attention to FMHA Cutlass Blackwell Forward kernel for fixed length (#4999)

ARM / AArch64

  • [New] Add FusedNBitRowwiseQuantizedSBHalfToFloatOrHalfNeon (#5199)
  • [Improvement] Fix tail handling in arm64 requantize_ (#5153)
  • [Improvement] Vectorize requantize_ for Arm64 with NEON intrinsics (#5130)
  • [New] Add NEON implementation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf (#5115)
  • [New] Add NEON-based FloatOrHalfToFused8BitRowwiseQuantizedSBFloat (#5089)
  • [Improvement] Remove AVX compilation on aarch64 (#5065)

ROCm / AMD

  • [Improvement] Update OSS build script to support AMD and CPU variants (#5257)
  • [Improvement] Update default target ROCm architectures in OSS build (#5219)
  • [Improvement] Bug fix in one specialized HIP instantiation of the warp-per-row kernel (#5214)
  • [Improvement] backward performance optimization for MI350 (#5177)
  • [Improvement] Fix ROCm HIPify failure in OSS (#5174)
  • [Improvement] embedding forward optimization for MI350 (#5064)

Build / CI Improvements and Better Engineering

  • [Improvement] Upgrade GitHub Actions to latest versions (#5223)