Releases · NVIDIA/Megatron-LM

26 Feb 04:17

svcnvidia-nemo-ci

core_v0.16.0

3bec9aa

NVIDIA Megatron Core 0.16.0 Latest

Latest

Changelog Details

ci: Fix copyright checker by @ko3n1g :: PR: #1893
chore: Add codeowners by @ko3n1g :: PR: #1897
ci: Extend queue-manager for dev branch by @ko3n1g :: PR: #1906
ci: Move test optimizer into its own bucket by @ko3n1g :: PR: #1909
ci: Configure cherrypick bot by @ko3n1g :: PR: #1925
Ci approve dev by @ko3n1g :: PR: #1933
ci: Update nightly schedule by @ko3n1g :: PR: #1934
ci: Bump pre-flight for runs on main/dev by @ko3n1g :: PR: #1935
ci: Allow skipping on main by @ko3n1g :: PR: #1936
Ko3n1g/ci/pr template community bot by @ko3n1g :: PR: #1937
ci: More granular unit tests buckets by @ko3n1g :: PR: #1932
Add sequence packing to RL by @tdene :: PR: #1911
chore: Update template by @ko3n1g :: PR: #1939
chore: Add description about who can merge by @ko3n1g :: PR: #1940
Ko3n1g/ci/fix main on eos by @ko3n1g :: PR: #1938
Ko3n1g/ci/internal mrs by @ko3n1g :: PR: #1942
ci: Fix branch of approval bot by @ko3n1g :: PR: #1944
ci: Approvalbot for other branches by @ko3n1g :: PR: #1947
ci(fix): Approval bot by @ko3n1g :: PR: #1949
Ko3n1g/ci/sync branches by @ko3n1g :: PR: #1956
Ko3n1g/ci/add milestone by @ko3n1g :: PR: #1951
Remove M-FSDP testing under LTS environment by @shjwudp :: PR: #1959
ci: Run on push to release branch by @ko3n1g :: PR: #1960
Fix typo in rl section of CODEOWNERS by @tdene :: PR: #1968
ci: Update copyright checker by @ko3n1g :: PR: #1973
Ko3n1g/ci/auto reminder GitHub by @ko3n1g :: PR: #1955
ci(fix): Run tests label by @ko3n1g :: PR: #1970
Make get_asyncio_loop safe to use repeatedly by @tdene :: PR: #1990
chore: Update codeowners by @ko3n1g :: PR: #2012
zarr soft deprecation by @dimapihtar :: PR: #2004
Deduplicate dynamic engine + coordinator. by @lmcafee-nvidia :: PR: #1981
Update symmetric registration interface to sync-up with upstream pytorch change by @youngeunkwon0405 :: PR: #1924
Safely access state dict args in load ckpt by @maanug-nv :: PR: #1957
Allow mixed-batch sampling in dynamic inference by @tdene :: PR: #1927
Stop Nemo_CICD_Test from failing in forks by @tdene :: PR: #2024
Clean up dynamic inference step by @tdene :: PR: #1992
ci: Auto-update copy-pr-bot vetters by @ko3n1g :: PR: #1850
ci: Fix build-push-wheel workflow by @ko3n1g :: PR: #2022
ci: Enable integration tests by @ko3n1g :: PR: #2023
chore: Update tooling for interactive jobs by @ko3n1g :: PR: #2032
Have datasets account for tokenizers which incorrectly define PAD by @tdene :: PR: #2017
revert(hotfix): ci: trustees_override by @ko3n1g :: PR: #2041
add missing warnings import in model parallel config by @yashaswikarnati :: PR: #2039
Reduce-scatter implementation with FP32 accumulation by @deepakn94 :: PR: #1967
ci(fix): Workflows on main by @ko3n1g :: PR: #2045
build: Bump modelopt by @ko3n1g :: PR: #2046
Remove TestCaptureFreezeGC unit test. by @lmcafee-nvidia :: PR: #1978
ci: Add multi-approval action by @ko3n1g :: PR: #2051
Ko3n1g/ci/test iteration time by @ko3n1g :: PR: #2067
Allow inference test throughput to vary by 10% by @mathemakitten :: PR: #2070
chore: Fix autoformatter by @ko3n1g :: PR: #2073
ci(hotfix): Bypass approvalbot in merge-queue by @ko3n1g :: PR: #2082
chore: Update local tooling by @ko3n1g :: PR: #2066
Add extra RL files by @tdene :: PR: #2077
Prevent summary jobs from running in forks by @tdene :: PR: #2083
ci: Fix test scope by @ko3n1g :: PR: #2091
Refactor the attention metadata into separate classes by @kanz-nv :: PR: #2001
Guard against incorrectly using MoE prefill graphs by @tdene :: PR: #2030
Run mr-slim tests in lightweight-mode by @chtruong814 :: PR: #2106
Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #1977
chore: Reenable trustees by @ko3n1g :: PR: #2108
Ko3n1g/chore/update release settings by @ko3n1g :: PR: #2097
ci(fix): Changeset of copyright checker by @ko3n1g :: PR: #2110
Remove unnecessary check on rotary_pos_cos by @santhnm2 :: PR: #2003
(Reverted) Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #2125
Refactor Attention Metadata to Separate Classes by @kanz-nv :: PR: #2112
Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: #2107
wandb Inference stats logging by @wdykas :: PR: #2026
Make PipelineParallelLayout always return str from __repr__ by @ananthsub :: PR: #2055
Add flash_attn_3 as first option for FA3 import by @santhnm2 :: PR: #2010
Add debugging hint for case when cudagraphs are created but no matching runner is found by @mathemakitten :: PR: #2129
ci: LTS container by @ko3n1g :: PR: #2133
Fix param init by @cuichenx :: PR: #2033
Hotfix to unit tests on hopper FA3 by @tdene :: PR: #2143
Add BytesIO to safe_globals by @tdene :: PR: #2074
add deprecation warning for legacy tokenizer system by @dimapihtar :: PR: #2145
replay: ci: Bump LTS container by @ko3n1g :: PR: #2157
Hotfix to unit tests on hopper FA3 (bis) by @tdene :: PR: #2179
Fix has_modelopt_state() for native Torch checkpoint format by @AAnoosheh :: PR: #2160
chore: Remove codeowners by @ko3n1g :: PR: #2175
Fix FP8 inference with sequence parallelism by @santhnm2 :: PR: #2009
Replace ModelOpt generation server by @AAnoosheh :: PR: #2147
Add hybrid model support for dynamic inference engine by @santhnm2 :: PR: #1907
Async task and event loop safety in Megatron Core by @tdene :: PR: #2025
Rename skip_prompt_log_probs by @tdene :: PR: #2181
Dynamic inference context | UVM only. by @lmcafee-nvidia :: PR: #1983
ci: Run auto-update-copy-pr-bot only on forks by @ko3n1g :: PR: #2191
Inference throughput tests: refactor goldens to be in list format by @mathemakitten :: PR: #2072
Enable TE custom quantization recipe by @negvet :: PR: #2005
Add MoE parameters to ModelOpt pruning example + conf fixes by @kevalmorabia97 :: PR: #2205
Add repr to pg collection class by @yashaswikarnati :: PR: #2089
Move data_samplers.py from legacy to training.datasets & add DistributedSignalHandler to DataLoader workers by @asolergi-nv :: PR: #2068
Fix Megatron-FSDP checkpoint save failure by @shjwudp :: PR: #2138
Fix moe CODEOWNERS. by @jaredcasper :: PR: #2200
chore: Update LICENSE by @ko3n1g :: PR: #2219
remove megatron.training dependency from megatron.core for FSDP checkpoint with EP by @ananthsub :: PR: #2113
Tensorize dynamic inference mixed sampling by @tdene :: PR: #2105
Add unit test for inference DP coordinator by @tdene :: PR: #2187
Inference linear layer by @sidsingh-nvidia :: PR: #1908
chore: Prefer Nvidia email addresses for reminder bot by @ko3n1g :: PR: #2221
[Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter by @shjwudp :: PR: #2218
Remove qwen symlink to fix for case-insensitive FS by @kevalmorabia97 :: PR: #2235
Optimizer refactor: clean up public get_megatron_optimizer interface and provide a more general API to support passing in different hyperparameters to subsets of parameters by @deepakn94 :: PR: #2047
Fix CI for PR#1983 by @lmcafee-nvidia :: PR: #2245
Fix aux-loss logging for hybrid models by @deepakn94 :: PR: #2197
Update flops calculation (for throughput) for hybrid MoEs by @deepakn94 :: PR: #2198
Enable kv cache in training for eagle by @yeyu-nvidia :: PR: #1895
Tensorize dynamic inference mixed sampling (bis) by @tdene :: PR: #2231
chore: Fix codeowners by @ko3n1g :: PR: #2264
Allow loading checkpoint from iteration 0 by @ananthsub :: PR: #2199
ci: Skip install test in merge queue by @chtruong814 :: PR: #2281
Add MoE layer type to hybrid models by @deepakn94 :: PR: #2259
Add the Hybrid-EP backend to the Flex Dispatcher by @Autumn1998 :: PR: #2176
[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding by @zhongbozhu :: PR: #1985
Update ModelOpt example readmes and advanced usage by @kevalmorabia97 :: PR: #2273
Fix UVM compatibility with CUDA 13. by @lmcafee-nvidia :: PR: #2243
ci: Add flaky marker to LTS tests by @ko3n1g :: PR: #2290
Dynamic engine suspend/resume via prefill. by @lmcafee-nvidia :: PR: #1982
fix: Pass the timeout argument for the EP group by @yanring :: PR: #2268
JIT for MoE router and preprocess by @yaox12 :: PR: #1919
Hotfix to CI, until the fix gets reviewed by @tdene :: PR: #2298
Add functional test for DP coordinator throughput by @tdene :: PR: #2189
Add asyncio Queue like in Python 3.13 by @tdene :: PR: #2224
Fixes for PR#1982 by @lmcafee-nvidia :: PR: #2303
Fix PP KV cache allocation and enable multi-node PP inference by @santhnm2 :: PR: #2182
Revert active-buffer-size-gb arg name. by @lmcafee-nvidia :: PR: #2257
feat: check: api backwards compatibility by @pablo-garay :: PR: #2251
Add MambaInferenceStateConfig dataclass by @santhnm2 :: PR: #2265
Fix typo in inference example by @santhnm2 :: PR: #2311
feat: initialization of API backward compatibility verification by @pablo-garay :: PR: #2310
Fix Mamba TP and remove confusing legacy initialization by @jaredcasper :: PR: #2202
Refactor KD to use ModelOpt plugins file by @AAnoosheh :: PR: #2305
Fix dynamic context syntax and remove redundant tensors by @kanz-nv :: PR: #2336
Improve asyncio exception handling by @tdene :: PR: #2300
ci: Upload to testpypi only on main by @ko3n1g :: PR: #2342
implement graph config by @kanz-nv :: PR: #2203
feat: required check adjustment by @pablo-garay :: PR: #2350
fix: load iteration 0 for release checkpoints by @ananthsub :: PR: #2351
Explicitly zero out padding token activations for dynamic inference by @santhnm2 :: PR: #2008
Bugfix for Mamba with Chunked-Prefill by @sidsingh-nvidia :: PR: #2293
Break apart dynamic inference step into 2 methods by @tdene :: PR: #2192
Prevent unnecessarily overwriting the default Hugging Face chat te...

Contributors

jaredcasper, sbhavani, and 88 other contributors

Assets 2

06 Feb 16:30

ko3n1g

v0.15.3

309ffca

NVIDIA Megatron Core 0.15.3

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

Assets 2

08 Jan 15:42

ko3n1g

core_v0.15.2

45b404c

NVIDIA Megatron Core 0.15.2

core_v0.15.2

Megatron-Core v0.15.2

Assets 2

07 Jan 18:23

ko3n1g

core_v0.15.1

512da5d

NVIDIA Megatron Core 0.15.1

core_v0.15.1

Core v0.15.1

Assets 2

17 Dec 23:08

ko3n1g

core_v0.15.0

0d7e02b

NVIDIA Megatron Core 0.15.0

Features
- Performance
  - Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
  - Use new TE interface for user buffers (MR !3886)
  - Add CPU activation offloading via TE (MR !4286)
  - Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
  - Add DTensor support for EP and DSv3 modules (MR !3955)
  - Add HybridEP backend to Flex Dispatcher (PR !2176)
  - Implement NVFP4 Zero Padding for MoE (PR !1985)
  - Compute shared experts before router (MR !4068)
  - Enable bias in expert MLP (MR !3858)
- Model support
  - Add YaRN support for GPT-OSS (MR !4044)
  - Add FP8 init for MTP (MR !3958)
  - Add fp8_dpa option for FP8 scaling (MR !4053)
- FSDP
  - Enable joint training of parallel modules (MR !3850)
- Inference
  - Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
  - Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
  - Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
  - Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
  - Enable KD support with hybrid training loop (MR !4021)
  - Add ModelOpt pruning example (MR !4022)
- RL
  - Add importance sampling and partial rollouts to Megatron RL (MR !4000)
  - Add sequence packing for RL (MR !4191)
- Ease of use
  - Handle CUDA absence during import (MR !4120)
  - Enable SWA mixing with attention (MR !3855)
Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
- Fix full iteration CUDA graph non-tensor handling (MR !4019)
- Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
- Fix HF import dtype and checkpoint loading issues (MR !4095)
- Fix missing initialization in ProcessGroupCollection (MR !4159)
- Fix sink attention TP (MR !4173)
- Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
- Fix stale state dict handling (MR !4226)
Known issues
New Contributors
- @marksverdhei made their first contribution in #1980
- @Skylion007 made their first contribution in #2047
- @azzhipa made their first contribution in 5db6704
- @vicoooo26 made their first contribution in 5db6704
- @A-transformer made their first contribution in e002b5c
- @chaitanyadwivedii made their first contribution in 20b3954

We'd like to thank all our external contributors whose work was merged in this release:

External Contributor Acknowledgements
- Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei in #1980
- Optimizer refactor: clean up public get_megatron_optimizer interface by @Skylion007 in #2047
- Typo fixes from community with co-authors @vicoooo26, @azzhipa, @A-transformer in 5db6704 and e002b5c
- Fix router input jitter dtype by @chaitanyadwivedii in 20b3954

Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.

Contributors

Skylion007, vicoooo26, and 4 other contributors

Assets 2

08 Oct 15:04

ko3n1g

core_v0.14.0

23e00ed

NVIDIA Megatron Core 0.14.0

Features
- Inference
  - Add async support for DynamicInferenceEngine (MR !3187)
  - Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
  - Force inference to always gather logits with tensor parallelism (MR !3442)
  - Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
- Post-training
  - ModelOpt updates (MR !3268)
    - Add speculative decoding AR validation feature
    - Add DeepSeek and Qwen model configs
- Performance
  - ModelCommProcessGroup integration (MR !3391)
  - Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
    - Flexible creation and management of communication groups
  - Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
- MoE
  - We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
  - Features:
    - Support Expert Parallel A2A Overlapping (MR !3470; MR !3074)
    - Support CP and recompute for MTP (MR !3330)
    - Add support for global aux loss (MR !3318)
  - Memory Optimization
    - Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
    - Support optimizer offloading for DSV3 FP8 training (MR !3659)
  - Performance Optimization
    - Add MoE router fusion (MR !3809)
    - Updates for MoE cudagraph (MR !3631)
  - Bug fixes:
    - Fix router input jitter dtype (MR !3774)
- Model support
  - Add MiMo video VLM train example (MR !3543)
  - Add AVLM for MIMO (MR !3624)
- Ease of use
  - Add uv support for source installs (MR !3615)
  - Automated weekly prereleases (MR !3574)
Bug fixes
- Use mscale_all_dim for softmax_factor (MR !2800)
- Fix FP8 param blockwise scaling unit test (MR !3480)
- Fix unit test blockwise scaling (MR !3491)
- Optimize prefill for token-less requests (MR !3499)
- Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
- Fix CUDA graph logic for flexible pp layout (MR !3505)
- Load FP8 models with strict=False (MR !3508)
- Skip rope check for torch < 1.4.0 (MR !3528)
- Disable Apex tests for stability (MR !3539)
- Fix typo in parallel_state expert parallelism (MR !3548)
- Guard modelopt on macOS (MR !3549)
- Retry on CUDA function failure (MR !3554)
- Fix NCCL mem pool creation error (MR !3557)
- Fix get_rotary_seq_len return type (MR !3559)
- Retry on CUDA function failure (MR !3560)
- Fix NCCL allocator attribute error (MR !3565)
- Ensure multi-prompt inference works (MR !3568)
- Fix MD5 on FIPS systems (MR !3577)
- Fixes dynamic context and inference bugs (MR !3582)
- Fix TE version for interleaved fused RoPE (MR !3586)
- Fix MTP with MoE and TP logging (MR !3594)
- Guard TE import fix (MR !3596)
- Add assertion for NCCL UB case (MR !3599)
- Remove Encoder PP related Functions (MR !3604)
- Fix segfaults in tests (MR !3605)
- Fix TE error in distributed optimizer (MR !3625)
- Remove redundant barrier in checkpoint flow (MR !3626)
- Support VPP MTP, fix logging (MR !3630)
- Retry mechanism for free(): invalid pointer errors (MR !3632)
- Fix test_replication.py issues (MR !3633)
- Fix typo in parallel_state (MR !3634)
- Fix CUDA graph logic determination (MR !3635)
- Fix TE installation error (MR !3636)
- Ensure correct sharding type in local tests (MR !3643)
- Fix cudagraphed backward buffer reuse for last layer (MR !3645)
- Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
- Fix dynamic example script errors (MR !3653)
- Guard TE import fix (MR !3666)
Breaking changes:
- megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
Known issues

Assets 2

03 Oct 14:41

mmarcinkiewicz

25.09-alpha.rc1

d339190

25.09-alpha.rc1

Add fp8 attn knobs

Assets 2

12 Aug 18:33

ko3n1g

core_v0.13.1

ea651a3

NVIDIA Megatron Core 0.13.1

Merge branch 'cherry-pick-f36e1705' into 'core_r0.13.0'

Cherry-pick 'Use ruff linter (3627)' into 'core_r0.13.0'

See merge request ADLR/megatron-lm!3793

Assets 2

11 Aug 04:12

ko3n1g

core_v0.14.0rc5

d338252

NVIDIA Megatron Core 0.14.0rc5 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc5 (2025-08-11)

Assets 2

12 Aug 18:12

ko3n1g

core_v0.12.3

3ea68ad

NVIDIA Megatron Core 0.12.3

Merge branch 'chtruong/cherry-pick-3627' into 'core_r0.12.0'

Cherry-pick 'use yaml safe load  (3627)' into 'core_r0.12.0'

See merge request ADLR/megatron-lm!3795

Assets 2

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.16.0

Contributors

Uh oh!

NVIDIA Megatron Core 0.15.3

Uh oh!

NVIDIA Megatron Core 0.15.2

Uh oh!

NVIDIA Megatron Core 0.15.1

Uh oh!

NVIDIA Megatron Core 0.15.0

Contributors

Uh oh!

NVIDIA Megatron Core 0.14.0

Uh oh!

25.09-alpha.rc1

Uh oh!

NVIDIA Megatron Core 0.13.1

Uh oh!

NVIDIA Megatron Core 0.14.0rc5

Uh oh!

NVIDIA Megatron Core 0.12.3

Uh oh!