Skip to content

Releases: NVIDIA-NeMo/Megatron-Bridge

NVIDIA Megatron-Bridge 0.2.1

18 Dec 00:04
v0.2.1
1c43b39

Choose a tag to compare

  • Performance
    • Activation offloading to host memory support with pipelining
      • Supports the high activation memory needs of MoE models training with dynamic shapes
      • Fixed Nemotron FLOPS calculation model
  • Model Collection Support
    • Ministral 3
  • Enhanced LoRA support
    • LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)

NVIDIA Megatron-Bridge 0.2.0

04 Dec 23:56
v0.2.0
7af9601

Choose a tag to compare

NVIDIA Megatron-Bridge 0.1.0rc4

23 Oct 20:35
6725f70

Choose a tag to compare

Pre-release
  • Fix docs build
  • Update performance scripts

NVIDIA Megatron-Bridge 0.1.0rc3

08 Oct 01:05
bf71eba

Choose a tag to compare

Pre-release
  • Model Collection Support
    • Llama
    • Qwen 2, Qwen 3, Qwen 3 MoE
    • DeepSeek
    • Mamba
  • Migration guide from NeMo 2 to Megatron-Bridge
  • Contribution guide for adding a new model
  • Checkpoint conversion from Hugging Face to Megatron
  • Performance
    • MoE LLM
      • Change the model to dropless with balanced gating
      • Fusion of operators in router function
      • Global permutation fusion with A2A dispatcher
      • EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
      • Precision-aware optimizer update to support BF16 states
    • Megatron FSDP
      • Migration from mcore FSDP to megatron FSDP
      • Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
      • Removed redundant optimizer operations
      • Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
      • IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
    • MXFP8
      • Improved act grad all-gather overlap performance via userbuffer
      • Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
      • Fusion of MXFP8 scaling factor swizzling kernels
      • Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
    • Others
      • Full iteration cuda graph for dense model without pipelining
      • Fusion of activation and cast fusion (currently tensor-wise scaling only)
      • Store SwiGLU input in FP8 to save activation memory

NVIDIA Megatron-Bridge 0.1.0a0

15 Aug 13:59
c6976d9

Choose a tag to compare

Pre-release
  • Llama and Qwen
  • Pretrain/SFT
  • PeFT
  • Recipe structure with examples for plain python & NeMo Run usage