|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## NVIDIA Megatron-Bridge 0.1.0rc3 |
| 4 | + |
| 5 | +* Model Collection Support |
| 6 | + * Llama |
| 7 | + * Qwen 2, Qwen 3, Qwen 3 MoE |
| 8 | + * DeepSeek |
| 9 | + * Mamba |
| 10 | +* [Migration guide from NeMo 2 to Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/nemo2-migration-guide.html) |
| 11 | +* [Contribution guide for adding a new model](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/adding-new-models.html) |
| 12 | +* [Checkpoint conversion from Hugging Face to Megatron](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/conversion) |
| 13 | +* [Performance](https://docs.nvidia.com/nemo/megatron-bridge/0.1.0/performance-summary.html) |
| 14 | + * MoE LLM |
| 15 | + * Change the model to dropless with balanced gating |
| 16 | + * Fusion of operators in router function |
| 17 | + * Global permutation fusion with A2A dispatcher |
| 18 | + * EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training |
| 19 | + * Precision-aware optimizer update to support BF16 states |
| 20 | + * Megatron FSDP |
| 21 | + * Migration from mcore FSDP to megatron FSDP |
| 22 | + * Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM |
| 23 | + * Removed redundant optimizer operations |
| 24 | + * Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage |
| 25 | + * IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28 |
| 26 | + * MXFP8 |
| 27 | + * Improved act grad all-gather overlap performance via userbuffer |
| 28 | + * Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter |
| 29 | + * Fusion of MXFP8 scaling factor swizzling kernels |
| 30 | + * Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead |
| 31 | + * Others |
| 32 | + * Full iteration cuda graph for dense model without pipelining |
| 33 | + * Fusion of activation and cast fusion (currently tensor-wise scaling only) |
| 34 | + * Store SwiGLU input in FP8 to save activation memory |
| 35 | + |
3 | 36 | ## NVIDIA Megatron-Bridge 0.1.0a0 |
4 | 37 |
|
5 | 38 | * Llama and Qwen |
|
0 commit comments