ADLR/megatron-lm!3229 - chore: Update changelog 0.12

ko3n1g · ko3n1g · commit d02f745bc72c · 2025-05-06T22:19:23.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,44 @@
 # Changelog
 
+## NVIDIA Megatron Core 0.12.0
+
+- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
+- Context parallel: fix loss scaling when calculate_per_token_loss=True
+- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
+- Inference
+  - Support in-flight batching and chunked KV cache
+  - Reduce memory usage,
+    - by not materializing full attention mask
+    - by only materializing logits for the last token during decode
+    - by removing an obsolete tensor reference
+- Hybrid Model
+  - Inference
+    - Add CUDA graph support
+    - Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
+    - Fix a shape issue when materializing logits for Mamba model
+  - Improve initialization of Mamba layers
+  - Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
+  - Make num_floating_point_operations work with hybrid model
+  - Make hybrid_conversion.py work with mixer that uses TE linear
+  - Add FP8 support
+  - Fix Mamba dt_bias tensor parallelism
+  - Support multimodal tokenizer
+  - Improve data parallelism scaling
+- MoE
+  - Features:
+    - DeepEP support, compatible with all the parallelisms and token drop / dropless
+    - Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
+    - CUDA Graph support for MoE
+    - Multi-Token Prediction (MTP) Support
+    - Fused indices_to_multihot kernel for DeepEP dispatcher
+  - Bug fixes:
+    - Fix Hang Issue with MoE+Dense Hybrid models
+    - Update theoretical memory and tflops estimation for MoE and MLA
+    - Fix MoE Aux loss scaling for per token loss
+    - Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
+  - Known issues:
+    - The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.
+
 ## NVIDIA Megatron Core 0.11.0
 
 - Add multi datacenter training support though N/S connection
@@ -17,6 +56,15 @@
     - Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
   - Known Issues:
     - When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
+- Add MX-FP16 support for optimizer and master weights
+- CUDA Graph memory optimizations
+- Enable UCC backend for PP communication
+- Optimizer CPU offload support for memory savings
+- Models
+  - Initial RADIO/CRADIO implementation
+  - llama3.2 support
+- Hybrid Model
+  - Support quantization via TensorRT Model Optimizer
 
 ## NVIDIA Megatron Core 0.10.0
 
@@ -45,6 +93,9 @@
     - Qwen Model support
 - Known Issues
   - When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
+- NVRx / Fault tolerance
+  - fault and hang detection in addition to existing straggler detection
+  - graceful exit and auto restart
 
 ## NVIDIA Megatron Core 0.8.0