|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## NVIDIA Megatron Core 0.12.0 |
| 4 | + |
| 5 | +- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16) |
| 6 | +- Context parallel: fix loss scaling when calculate_per_token_loss=True |
| 7 | +- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw) |
| 8 | +- Inference |
| 9 | + - Support in-flight batching and chunked KV cache |
| 10 | + - Reduce memory usage, |
| 11 | + - by not materializing full attention mask |
| 12 | + - by only materializing logits for the last token during decode |
| 13 | + - by removing an obsolete tensor reference |
| 14 | +- Hybrid Model |
| 15 | + - Inference |
| 16 | + - Add CUDA graph support |
| 17 | + - Change tools/run_mamba_text_generation_server.py to use megatron.core.inference |
| 18 | + - Fix a shape issue when materializing logits for Mamba model |
| 19 | + - Improve initialization of Mamba layers |
| 20 | + - Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model) |
| 21 | + - Make num_floating_point_operations work with hybrid model |
| 22 | + - Make hybrid_conversion.py work with mixer that uses TE linear |
| 23 | + - Add FP8 support |
| 24 | + - Fix Mamba dt_bias tensor parallelism |
| 25 | + - Support multimodal tokenizer |
| 26 | + - Improve data parallelism scaling |
| 27 | +- MoE |
| 28 | + - Features: |
| 29 | + - DeepEP support, compatible with all the parallelisms and token drop / dropless |
| 30 | + - Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training |
| 31 | + - CUDA Graph support for MoE |
| 32 | + - Multi-Token Prediction (MTP) Support |
| 33 | + - Fused indices_to_multihot kernel for DeepEP dispatcher |
| 34 | + - Bug fixes: |
| 35 | + - Fix Hang Issue with MoE+Dense Hybrid models |
| 36 | + - Update theoretical memory and tflops estimation for MoE and MLA |
| 37 | + - Fix MoE Aux loss scaling for per token loss |
| 38 | + - Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications |
| 39 | + - Known issues: |
| 40 | + - The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training. |
| 41 | + |
3 | 42 | ## NVIDIA Megatron Core 0.11.0 |
4 | 43 |
|
5 | 44 | - Add multi datacenter training support though N/S connection |
|
17 | 56 | - Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding. |
18 | 57 | - Known Issues: |
19 | 58 | - When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params. |
| 59 | +- Add MX-FP16 support for optimizer and master weights |
| 60 | +- CUDA Graph memory optimizations |
| 61 | +- Enable UCC backend for PP communication |
| 62 | +- Optimizer CPU offload support for memory savings |
| 63 | +- Models |
| 64 | + - Initial RADIO/CRADIO implementation |
| 65 | + - llama3.2 support |
| 66 | +- Hybrid Model |
| 67 | + - Support quantization via TensorRT Model Optimizer |
20 | 68 |
|
21 | 69 | ## NVIDIA Megatron Core 0.10.0 |
22 | 70 |
|
|
45 | 93 | - Qwen Model support |
46 | 94 | - Known Issues |
47 | 95 | - When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context. |
| 96 | +- NVRx / Fault tolerance |
| 97 | + - fault and hang detection in addition to existing straggler detection |
| 98 | + - graceful exit and auto restart |
48 | 99 |
|
49 | 100 | ## NVIDIA Megatron Core 0.8.0 |
50 | 101 |
|
|
0 commit comments