Skip to content

Commit d02f745

Browse files
committed
ADLR/megatron-lm!3229 - chore: Update changelog 0.12
1 parent 3cdfd61 commit d02f745

File tree

1 file changed

+51
-0
lines changed

1 file changed

+51
-0
lines changed

CHANGELOG.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,44 @@
11
# Changelog
22

3+
## NVIDIA Megatron Core 0.12.0
4+
5+
- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
6+
- Context parallel: fix loss scaling when calculate_per_token_loss=True
7+
- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
8+
- Inference
9+
- Support in-flight batching and chunked KV cache
10+
- Reduce memory usage,
11+
- by not materializing full attention mask
12+
- by only materializing logits for the last token during decode
13+
- by removing an obsolete tensor reference
14+
- Hybrid Model
15+
- Inference
16+
- Add CUDA graph support
17+
- Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
18+
- Fix a shape issue when materializing logits for Mamba model
19+
- Improve initialization of Mamba layers
20+
- Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
21+
- Make num_floating_point_operations work with hybrid model
22+
- Make hybrid_conversion.py work with mixer that uses TE linear
23+
- Add FP8 support
24+
- Fix Mamba dt_bias tensor parallelism
25+
- Support multimodal tokenizer
26+
- Improve data parallelism scaling
27+
- MoE
28+
- Features:
29+
- DeepEP support, compatible with all the parallelisms and token drop / dropless
30+
- Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
31+
- CUDA Graph support for MoE
32+
- Multi-Token Prediction (MTP) Support
33+
- Fused indices_to_multihot kernel for DeepEP dispatcher
34+
- Bug fixes:
35+
- Fix Hang Issue with MoE+Dense Hybrid models
36+
- Update theoretical memory and tflops estimation for MoE and MLA
37+
- Fix MoE Aux loss scaling for per token loss
38+
- Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
39+
- Known issues:
40+
- The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.
41+
342
## NVIDIA Megatron Core 0.11.0
443

544
- Add multi datacenter training support though N/S connection
@@ -17,6 +56,15 @@
1756
- Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
1857
- Known Issues:
1958
- When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
59+
- Add MX-FP16 support for optimizer and master weights
60+
- CUDA Graph memory optimizations
61+
- Enable UCC backend for PP communication
62+
- Optimizer CPU offload support for memory savings
63+
- Models
64+
- Initial RADIO/CRADIO implementation
65+
- llama3.2 support
66+
- Hybrid Model
67+
- Support quantization via TensorRT Model Optimizer
2068

2169
## NVIDIA Megatron Core 0.10.0
2270

@@ -45,6 +93,9 @@
4593
- Qwen Model support
4694
- Known Issues
4795
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
96+
- NVRx / Fault tolerance
97+
- fault and hang detection in addition to existing straggler detection
98+
- graceful exit and auto restart
4899

49100
## NVIDIA Megatron Core 0.8.0
50101

0 commit comments

Comments
 (0)