Skip to content

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.13.0rc0

27 Jun 16:39

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.13.0rc0 (2025-06-27)

NVIDIA Megatron Core 0.12.2

09 Jul 07:46

Choose a tag to compare

Merge branch 'completions_unit_test_fix' into 'core_r0.12.0'

Fixes for completions endpoint unit test

See merge request ADLR/megatron-lm!3445

NVIDIA Megatron Core 0.12.1

23 May 09:54

Choose a tag to compare

Merge branch 'gaod/llama4/te_fix' into 'core_r0.12.0'

Fix the TE assertion for release

See merge request ADLR/megatron-lm!3340

NVIDIA Megatron Core 0.12.0

06 May 21:10
core_v0.12.0
d580efc

Choose a tag to compare

  • Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
  • Context parallel: fix loss scaling when calculate_per_token_loss=True
  • Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
  • Inference
    • Support in-flight batching and chunked KV cache
    • Reduce memory usage,
      • by not materializing full attention mask
      • by only materializing logits for the last token during decode
      • by removing an obsolete tensor reference
  • Hybrid Model
    • Inference
      • Add CUDA graph support
      • Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
      • Fix a shape issue when materializing logits for Mamba model
    • Improve initialization of Mamba layers
    • Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
    • Make num_floating_point_operations work with hybrid model
    • Make hybrid_conversion.py work with mixer that uses TE linear
    • Add FP8 support
    • Fix Mamba dt_bias tensor parallelism
    • Support multimodal tokenizer
    • Improve data parallelism scaling
  • MoE
    • Features:
      • DeepEP support, compatible with all the parallelisms and token drop / dropless
      • Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
      • CUDA Graph support for MoE
      • Multi-Token Prediction (MTP) Support
      • Fused indices_to_multihot kernel for DeepEP dispatcher
    • Bug fixes:
      • Fix Hang Issue with MoE+Dense Hybrid models
      • Update theoretical memory and tflops estimation for MoE and MLA
      • Fix MoE Aux loss scaling for per token loss
      • Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
    • Known issues:
      • The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

NVIDIA Megatron Core 0.12.0rc3

15 Apr 19:50

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc3 (2025-04-15)

NVIDIA Megatron Core 0.12.0rc2

09 Apr 10:27

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc2 (2025-04-09)

NVIDIA Megatron Core 0.11.0

14 Mar 22:59
aa6207e

Choose a tag to compare

  • Add multi datacenter training support though N/S connection
  • MoE
    • Features
      • Support DeepSeek-V3 fine-tuning
        • Aux-loss-free load balancing strategy
        • Node-limited routing and Device-limited routing support.
        • Tensor Parallelism support for MLA and Sequence Auxiliary Loss
        • MTP (with TP and PP support) is coming soon.
      • Permutation / Unpermutation fusion kernel from TransformerEngine.
      • Uneven virtual pipeline parallel split support in first and last PP stage.
    • Bug fixes:
      • Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
      • Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
    • Known Issues:
      • When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.

NVIDIA Megatron Core 0.11.0rc0

20 Feb 10:43
7c00175

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.11.0rc0 (2025-02-20)

NVIDIA Megatron Core 0.10.0

17 Feb 17:31
7ee599a

Choose a tag to compare

  • Adding MLA to MCore
  • Enable FP8 for GroupedMLP
  • MoE Parallel Folding
  • Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
  • Multimodal: NVLM training and evaluation support in MCore
  • Mamba Hybrid
    • Increase performance and reduce memory footprint of Triton language/compiler distributed caching
    • Add more unit testing and fix bugs

NVIDIA Megatron Core 0.9.0

24 Oct 10:30

Choose a tag to compare

  • Uneven pipeline parallelism
    • Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
  • Per layer CUDAGraph support for GPT training with Transformer Engine modules
  • Enable different TP sizes for the vision encoder
  • Enable pipeline parallelism for T5 & Llava models
  • Support multi-tile multi-image input in Llava models
  • MoE
    • FP8 support
    • Runtime upcycling support
    • Dispatcher implementation optimizations
    • Shared expert support with overlapping optimizations
      • Qwen Model support
  • Mamba Hybrid
    • Main branch is no longer compatible with released checkpoints (use ssm branch)
    • Add distributed checkpointing
    • Fix bugs related to inference
    • Add unit tests
  • Known Issues
    • When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.