Skip to content

Commit bf71eba

Browse files
Add 0.1.0rc3 changelog (#862) (#889)
* Add 0.1.0rc3 changelog * Fix 0.1.0a0 changelog * Update Changelog * Apply suggestions from code review * Update changelog with migration guide * Add perf summary link * Update CHANGELOG.md --------- Signed-off-by: Charlie Truong <[email protected]> Co-authored-by: Charlie Truong <[email protected]>
1 parent 0993dd8 commit bf71eba

File tree

1 file changed

+33
-0
lines changed

1 file changed

+33
-0
lines changed

CHANGELOG.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,38 @@
11
# Changelog
22

3+
## NVIDIA Megatron-Bridge 0.1.0rc3
4+
5+
* Model Collection Support
6+
* Llama
7+
* Qwen 2, Qwen 3, Qwen 3 MoE
8+
* DeepSeek
9+
* Mamba
10+
* [Migration guide from NeMo 2 to Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/nemo2-migration-guide.html)
11+
* [Contribution guide for adding a new model](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/adding-new-models.html)
12+
* [Checkpoint conversion from Hugging Face to Megatron](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/conversion)
13+
* [Performance](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/performance-summary.html)
14+
* MoE LLM
15+
* Change the model to dropless with balanced gating
16+
* Fusion of operators in router function
17+
* Global permutation fusion with A2A dispatcher
18+
* EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
19+
* Precision-aware optimizer update to support BF16 states
20+
* Megatron FSDP
21+
* Migration from mcore FSDP to megatron FSDP
22+
* Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
23+
* Removed redundant optimizer operations
24+
* Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
25+
* IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
26+
* MXFP8
27+
* Improved act grad all-gather overlap performance via userbuffer
28+
* Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
29+
* Fusion of MXFP8 scaling factor swizzling kernels
30+
* Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
31+
* Others
32+
* Full iteration cuda graph for dense model without pipelining
33+
* Fusion of activation and cast fusion (currently tensor-wise scaling only)
34+
* Store SwiGLU input in FP8 to save activation memory
35+
336
## NVIDIA Megatron-Bridge 0.1.0a0
437

538
* Llama and Qwen

0 commit comments

Comments
 (0)