Releases · NVIDIA/Megatron-LM

27 Jun 16:39

ko3n1g

core_v0.13.0rc0

d8180ef

NVIDIA Megatron Core 0.13.0rc0 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.13.0rc0 (2025-06-27)

Assets 2

09 Jul 07:46

ko3n1g

core_v0.12.2

64cb6bd

NVIDIA Megatron Core 0.12.2

Merge branch 'completions_unit_test_fix' into 'core_r0.12.0'

Fixes for completions endpoint unit test

See merge request ADLR/megatron-lm!3445

Assets 2

23 May 09:54

ko3n1g

core_v0.12.1

a845aa7

NVIDIA Megatron Core 0.12.1

Merge branch 'gaod/llama4/te_fix' into 'core_r0.12.0'

Fix the TE assertion for release

See merge request ADLR/megatron-lm!3340

Assets 2

06 May 21:10

ko3n1g

core_v0.12.0

d580efc

NVIDIA Megatron Core 0.12.0

Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
Context parallel: fix loss scaling when calculate_per_token_loss=True
Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
Inference
- Support in-flight batching and chunked KV cache
- Reduce memory usage,
  - by not materializing full attention mask
  - by only materializing logits for the last token during decode
  - by removing an obsolete tensor reference
Hybrid Model
- Inference
  - Add CUDA graph support
  - Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
  - Fix a shape issue when materializing logits for Mamba model
- Improve initialization of Mamba layers
- Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
- Make num_floating_point_operations work with hybrid model
- Make hybrid_conversion.py work with mixer that uses TE linear
- Add FP8 support
- Fix Mamba dt_bias tensor parallelism
- Support multimodal tokenizer
- Improve data parallelism scaling
MoE
- Features:
  - DeepEP support, compatible with all the parallelisms and token drop / dropless
  - Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
  - CUDA Graph support for MoE
  - Multi-Token Prediction (MTP) Support
  - Fused indices_to_multihot kernel for DeepEP dispatcher
- Bug fixes:
  - Fix Hang Issue with MoE+Dense Hybrid models
  - Update theoretical memory and tflops estimation for MoE and MLA
  - Fix MoE Aux loss scaling for per token loss
  - Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
- Known issues:
  - The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

Assets 2

15 Apr 19:50

ko3n1g

v0.12.0rc3

408eb71

NVIDIA Megatron Core 0.12.0rc3 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc3 (2025-04-15)

Assets 2

09 Apr 10:27

ko3n1g

v0.12.0rc2

31f08ca

NVIDIA Megatron Core 0.12.0rc2 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc2 (2025-04-09)

Assets 2

14 Mar 22:59

ko3n1g

v0.11.0

aa6207e

NVIDIA Megatron Core 0.11.0

Add multi datacenter training support though N/S connection
MoE
- Features
  - Support DeepSeek-V3 fine-tuning
    - Aux-loss-free load balancing strategy
    - Node-limited routing and Device-limited routing support.
    - Tensor Parallelism support for MLA and Sequence Auxiliary Loss
    - MTP (with TP and PP support) is coming soon.
  - Permutation / Unpermutation fusion kernel from TransformerEngine.
  - Uneven virtual pipeline parallel split support in first and last PP stage.
- Bug fixes:
  - Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
  - Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
- Known Issues:
  - When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.

Assets 2

20 Feb 10:43

ko3n1g

v0.11.0rc0

7c00175

NVIDIA Megatron Core 0.11.0rc0 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.11.0rc0 (2025-02-20)

Assets 2

17 Feb 17:31

ko3n1g

core_r0.10.0

7ee599a

NVIDIA Megatron Core 0.10.0

Adding MLA to MCore
Enable FP8 for GroupedMLP
MoE Parallel Folding
Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
Multimodal: NVLM training and evaluation support in MCore
Mamba Hybrid
- Increase performance and reduce memory footprint of Triton language/compiler distributed caching
- Add more unit testing and fix bugs

Assets 2

24 Oct 10:30

ko3n1g

core_r0.9.0

1afee59

NVIDIA Megatron Core 0.9.0

Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
Per layer CUDAGraph support for GPT training with Transformer Engine modules
Enable different TP sizes for the vision encoder
Enable pipeline parallelism for T5 & Llava models
Support multi-tile multi-image input in Llava models
MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
  - Qwen Model support
Mamba Hybrid
- Main branch is no longer compatible with released checkpoints (use ssm branch)
- Add distributed checkpointing
- Fix bugs related to inference
- Add unit tests
Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.

Assets 2

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.13.0rc0

Uh oh!

NVIDIA Megatron Core 0.12.2

Uh oh!

NVIDIA Megatron Core 0.12.1

Uh oh!

NVIDIA Megatron Core 0.12.0

Uh oh!

NVIDIA Megatron Core 0.12.0rc3

Uh oh!

NVIDIA Megatron Core 0.12.0rc2

Uh oh!

NVIDIA Megatron Core 0.11.0

Uh oh!

NVIDIA Megatron Core 0.11.0rc0

Uh oh!

NVIDIA Megatron Core 0.10.0

Uh oh!

NVIDIA Megatron Core 0.9.0

Uh oh!