Releases: NVIDIA/Megatron-LM
Releases · NVIDIA/Megatron-LM
NVIDIA Megatron Core 0.16.0
Changelog Details
- ci: Fix copyright checker by @ko3n1g :: PR: #1893
- chore: Add codeowners by @ko3n1g :: PR: #1897
- ci: Extend queue-manager for dev branch by @ko3n1g :: PR: #1906
- ci: Move test optimizer into its own bucket by @ko3n1g :: PR: #1909
- ci: Configure cherrypick bot by @ko3n1g :: PR: #1925
- Ci approve dev by @ko3n1g :: PR: #1933
- ci: Update nightly schedule by @ko3n1g :: PR: #1934
- ci: Bump pre-flight for runs on main/dev by @ko3n1g :: PR: #1935
- ci: Allow skipping on main by @ko3n1g :: PR: #1936
- Ko3n1g/ci/pr template community bot by @ko3n1g :: PR: #1937
- ci: More granular unit tests buckets by @ko3n1g :: PR: #1932
- Add sequence packing to RL by @tdene :: PR: #1911
- chore: Update template by @ko3n1g :: PR: #1939
- chore: Add description about who can merge by @ko3n1g :: PR: #1940
- Ko3n1g/ci/fix main on eos by @ko3n1g :: PR: #1938
- Ko3n1g/ci/internal mrs by @ko3n1g :: PR: #1942
- ci: Fix branch of approval bot by @ko3n1g :: PR: #1944
- ci: Approvalbot for other branches by @ko3n1g :: PR: #1947
- ci(fix): Approval bot by @ko3n1g :: PR: #1949
- Ko3n1g/ci/sync branches by @ko3n1g :: PR: #1956
- Ko3n1g/ci/add milestone by @ko3n1g :: PR: #1951
- Remove M-FSDP testing under LTS environment by @shjwudp :: PR: #1959
- ci: Run on push to release branch by @ko3n1g :: PR: #1960
- Fix typo in rl section of CODEOWNERS by @tdene :: PR: #1968
- ci: Update copyright checker by @ko3n1g :: PR: #1973
- Ko3n1g/ci/auto reminder GitHub by @ko3n1g :: PR: #1955
- ci(fix):
Run testslabel by @ko3n1g :: PR: #1970 - Make
get_asyncio_loopsafe to use repeatedly by @tdene :: PR: #1990 - chore: Update codeowners by @ko3n1g :: PR: #2012
- zarr soft deprecation by @dimapihtar :: PR: #2004
- Deduplicate dynamic engine + coordinator. by @lmcafee-nvidia :: PR: #1981
- Update symmetric registration interface to sync-up with upstream pytorch change by @youngeunkwon0405 :: PR: #1924
- Safely access state dict args in load ckpt by @maanug-nv :: PR: #1957
- Allow mixed-batch sampling in dynamic inference by @tdene :: PR: #1927
- Stop Nemo_CICD_Test from failing in forks by @tdene :: PR: #2024
- Clean up dynamic inference step by @tdene :: PR: #1992
- ci: Auto-update copy-pr-bot vetters by @ko3n1g :: PR: #1850
- ci: Fix build-push-wheel workflow by @ko3n1g :: PR: #2022
- ci: Enable integration tests by @ko3n1g :: PR: #2023
- chore: Update tooling for interactive jobs by @ko3n1g :: PR: #2032
- Have datasets account for tokenizers which incorrectly define PAD by @tdene :: PR: #2017
- revert(hotfix): ci: trustees_override by @ko3n1g :: PR: #2041
- add missing warnings import in model parallel config by @yashaswikarnati :: PR: #2039
- Reduce-scatter implementation with FP32 accumulation by @deepakn94 :: PR: #1967
- ci(fix): Workflows on
mainby @ko3n1g :: PR: #2045 - build: Bump modelopt by @ko3n1g :: PR: #2046
- Remove TestCaptureFreezeGC unit test. by @lmcafee-nvidia :: PR: #1978
- ci: Add multi-approval action by @ko3n1g :: PR: #2051
- Ko3n1g/ci/test iteration time by @ko3n1g :: PR: #2067
- Allow inference test throughput to vary by 10% by @mathemakitten :: PR: #2070
- chore: Fix autoformatter by @ko3n1g :: PR: #2073
- ci(hotfix): Bypass approvalbot in merge-queue by @ko3n1g :: PR: #2082
- chore: Update local tooling by @ko3n1g :: PR: #2066
- Add extra RL files by @tdene :: PR: #2077
- Prevent summary jobs from running in forks by @tdene :: PR: #2083
- ci: Fix test scope by @ko3n1g :: PR: #2091
- Refactor the attention metadata into separate classes by @kanz-nv :: PR: #2001
- Guard against incorrectly using MoE prefill graphs by @tdene :: PR: #2030
- Run mr-slim tests in lightweight-mode by @chtruong814 :: PR: #2106
- Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #1977
- chore: Reenable trustees by @ko3n1g :: PR: #2108
- Ko3n1g/chore/update release settings by @ko3n1g :: PR: #2097
- ci(fix): Changeset of copyright checker by @ko3n1g :: PR: #2110
- Remove unnecessary check on rotary_pos_cos by @santhnm2 :: PR: #2003
- (Reverted) Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #2125
- Refactor Attention Metadata to Separate Classes by @kanz-nv :: PR: #2112
- Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: #2107
- wandb Inference stats logging by @wdykas :: PR: #2026
- Make
PipelineParallelLayoutalways return str from__repr__by @ananthsub :: PR: #2055 - Add flash_attn_3 as first option for FA3 import by @santhnm2 :: PR: #2010
- Add debugging hint for case when cudagraphs are created but no matching runner is found by @mathemakitten :: PR: #2129
- ci: LTS container by @ko3n1g :: PR: #2133
- Fix param init by @cuichenx :: PR: #2033
- Hotfix to unit tests on hopper FA3 by @tdene :: PR: #2143
- Add BytesIO to safe_globals by @tdene :: PR: #2074
- add deprecation warning for legacy tokenizer system by @dimapihtar :: PR: #2145
- replay: ci: Bump LTS container by @ko3n1g :: PR: #2157
- Hotfix to unit tests on hopper FA3 (bis) by @tdene :: PR: #2179
- Fix has_modelopt_state() for native Torch checkpoint format by @AAnoosheh :: PR: #2160
- chore: Remove codeowners by @ko3n1g :: PR: #2175
- Fix FP8 inference with sequence parallelism by @santhnm2 :: PR: #2009
- Replace ModelOpt generation server by @AAnoosheh :: PR: #2147
- Add hybrid model support for dynamic inference engine by @santhnm2 :: PR: #1907
- Async task and event loop safety in Megatron Core by @tdene :: PR: #2025
- Rename skip_prompt_log_probs by @tdene :: PR: #2181
- Dynamic inference context | UVM only. by @lmcafee-nvidia :: PR: #1983
- ci: Run
auto-update-copy-pr-botonly on forks by @ko3n1g :: PR: #2191 - Inference throughput tests: refactor goldens to be in list format by @mathemakitten :: PR: #2072
- Enable TE custom quantization recipe by @negvet :: PR: #2005
- Add MoE parameters to ModelOpt pruning example + conf fixes by @kevalmorabia97 :: PR: #2205
- Add repr to pg collection class by @yashaswikarnati :: PR: #2089
- Move
data_samplers.pyfromlegacytotraining.datasets& addDistributedSignalHandlerto DataLoader workers by @asolergi-nv :: PR: #2068 - Fix Megatron-FSDP checkpoint save failure by @shjwudp :: PR: #2138
- Fix moe CODEOWNERS. by @jaredcasper :: PR: #2200
- chore: Update LICENSE by @ko3n1g :: PR: #2219
- remove
megatron.trainingdependency frommegatron.corefor FSDP checkpoint with EP by @ananthsub :: PR: #2113 - Tensorize dynamic inference mixed sampling by @tdene :: PR: #2105
- Add unit test for inference DP coordinator by @tdene :: PR: #2187
- Inference linear layer by @sidsingh-nvidia :: PR: #1908
- chore: Prefer Nvidia email addresses for reminder bot by @ko3n1g :: PR: #2221
- [Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter by @shjwudp :: PR: #2218
- Remove qwen symlink to fix for case-insensitive FS by @kevalmorabia97 :: PR: #2235
- Optimizer refactor: clean up public
get_megatron_optimizerinterface and provide a more general API to support passing in different hyperparameters to subsets of parameters by @deepakn94 :: PR: #2047 - Fix CI for PR#1983 by @lmcafee-nvidia :: PR: #2245
- Fix aux-loss logging for hybrid models by @deepakn94 :: PR: #2197
- Update flops calculation (for throughput) for hybrid MoEs by @deepakn94 :: PR: #2198
- Enable kv cache in training for eagle by @yeyu-nvidia :: PR: #1895
- Tensorize dynamic inference mixed sampling (bis) by @tdene :: PR: #2231
- chore: Fix codeowners by @ko3n1g :: PR: #2264
- Allow loading checkpoint from iteration 0 by @ananthsub :: PR: #2199
- ci: Skip install test in merge queue by @chtruong814 :: PR: #2281
- Add MoE layer type to hybrid models by @deepakn94 :: PR: #2259
- Add the Hybrid-EP backend to the Flex Dispatcher by @Autumn1998 :: PR: #2176
- [MAIN][NVFP4] Support NVFP4 MOE with Proper Padding by @zhongbozhu :: PR: #1985
- Update ModelOpt example readmes and advanced usage by @kevalmorabia97 :: PR: #2273
- Fix UVM compatibility with CUDA 13. by @lmcafee-nvidia :: PR: #2243
- ci: Add flaky marker to LTS tests by @ko3n1g :: PR: #2290
- Dynamic engine suspend/resume via prefill. by @lmcafee-nvidia :: PR: #1982
- fix: Pass the timeout argument for the EP group by @yanring :: PR: #2268
- JIT for MoE router and preprocess by @yaox12 :: PR: #1919
- Hotfix to CI, until the fix gets reviewed by @tdene :: PR: #2298
- Add functional test for DP coordinator throughput by @tdene :: PR: #2189
- Add asyncio Queue like in Python 3.13 by @tdene :: PR: #2224
- Fixes for PR#1982 by @lmcafee-nvidia :: PR: #2303
- Fix PP KV cache allocation and enable multi-node PP inference by @santhnm2 :: PR: #2182
- Revert active-buffer-size-gb arg name. by @lmcafee-nvidia :: PR: #2257
- feat: check: api backwards compatibility by @pablo-garay :: PR: #2251
- Add MambaInferenceStateConfig dataclass by @santhnm2 :: PR: #2265
- Fix typo in inference example by @santhnm2 :: PR: #2311
- feat: initialization of API backward compatibility verification by @pablo-garay :: PR: #2310
- Fix Mamba TP and remove confusing legacy initialization by @jaredcasper :: PR: #2202
- Refactor KD to use ModelOpt plugins file by @AAnoosheh :: PR: #2305
- Fix dynamic context syntax and remove redundant tensors by @kanz-nv :: PR: #2336
- Improve asyncio exception handling by @tdene :: PR: #2300
- ci: Upload to testpypi only on main by @ko3n1g :: PR: #2342
- implement graph config by @kanz-nv :: PR: #2203
- feat: required check adjustment by @pablo-garay :: PR: #2350
- fix: load iteration 0 for release checkpoints by @ananthsub :: PR: #2351
- Explicitly zero out padding token activations for dynamic inference by @santhnm2 :: PR: #2008
- Bugfix for Mamba with Chunked-Prefill by @sidsingh-nvidia :: PR: #2293
- Break apart dynamic inference step into 2 methods by @tdene :: PR: #2192
- Prevent unnecessarily overwriting the default Hugging Face chat te...
NVIDIA Megatron Core 0.15.3
This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com
NVIDIA Megatron Core 0.15.2
core_v0.15.2 Megatron-Core v0.15.2
NVIDIA Megatron Core 0.15.1
core_v0.15.1 Core v0.15.1
NVIDIA Megatron Core 0.15.0
- Features
- Performance
- MoE
- Model support
- FSDP
- Enable joint training of parallel modules (MR !3850)
- Inference
- Post-training
- RL
- Ease of use
- Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
- Fix full iteration CUDA graph non-tensor handling (MR !4019)
- Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
- Fix HF import dtype and checkpoint loading issues (MR !4095)
- Fix missing initialization in ProcessGroupCollection (MR !4159)
- Fix sink attention TP (MR !4173)
- Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
- Fix stale state dict handling (MR !4226)
- Known issues
- New Contributors
- @marksverdhei made their first contribution in #1980
- @Skylion007 made their first contribution in #2047
- @azzhipa made their first contribution in 5db6704
- @vicoooo26 made their first contribution in 5db6704
- @A-transformer made their first contribution in e002b5c
- @chaitanyadwivedii made their first contribution in 20b3954
We'd like to thank all our external contributors whose work was merged in this release:
- External Contributor Acknowledgements
- Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei in #1980
- Optimizer refactor: clean up public get_megatron_optimizer interface by @Skylion007 in #2047
- Typo fixes from community with co-authors @vicoooo26, @azzhipa, @A-transformer in 5db6704 and e002b5c
- Fix router input jitter dtype by @chaitanyadwivedii in 20b3954
Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.
NVIDIA Megatron Core 0.14.0
- Features
- Inference
- Post-training
- ModelOpt updates (MR !3268)
- Add speculative decoding AR validation feature
- Add DeepSeek and Qwen model configs
- ModelOpt updates (MR !3268)
- Performance
- MoE
- We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
- Features:
- Memory Optimization
- Performance Optimization
- Bug fixes:
- Fix router input jitter dtype (MR !3774)
- Model support
- Ease of use
- Bug fixes
- Use mscale_all_dim for softmax_factor (MR !2800)
- Fix FP8 param blockwise scaling unit test (MR !3480)
- Fix unit test blockwise scaling (MR !3491)
- Optimize prefill for token-less requests (MR !3499)
- Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
- Fix CUDA graph logic for flexible pp layout (MR !3505)
- Load FP8 models with strict=False (MR !3508)
- Skip rope check for torch < 1.4.0 (MR !3528)
- Disable Apex tests for stability (MR !3539)
- Fix typo in parallel_state expert parallelism (MR !3548)
- Guard modelopt on macOS (MR !3549)
- Retry on CUDA function failure (MR !3554)
- Fix NCCL mem pool creation error (MR !3557)
- Fix get_rotary_seq_len return type (MR !3559)
- Retry on CUDA function failure (MR !3560)
- Fix NCCL allocator attribute error (MR !3565)
- Ensure multi-prompt inference works (MR !3568)
- Fix MD5 on FIPS systems (MR !3577)
- Fixes dynamic context and inference bugs (MR !3582)
- Fix TE version for interleaved fused RoPE (MR !3586)
- Fix MTP with MoE and TP logging (MR !3594)
- Guard TE import fix (MR !3596)
- Add assertion for NCCL UB case (MR !3599)
- Remove Encoder PP related Functions (MR !3604)
- Fix segfaults in tests (MR !3605)
- Fix TE error in distributed optimizer (MR !3625)
- Remove redundant barrier in checkpoint flow (MR !3626)
- Support VPP MTP, fix logging (MR !3630)
- Retry mechanism for free(): invalid pointer errors (MR !3632)
- Fix test_replication.py issues (MR !3633)
- Fix typo in parallel_state (MR !3634)
- Fix CUDA graph logic determination (MR !3635)
- Fix TE installation error (MR !3636)
- Ensure correct sharding type in local tests (MR !3643)
- Fix cudagraphed backward buffer reuse for last layer (MR !3645)
- Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
- Fix dynamic example script errors (MR !3653)
- Guard TE import fix (MR !3666)
- Breaking changes:
megatron.core.distributed.custom_fsdprefactored as breaking change tomegatron.core.distributed.fsdp.src.megatron_fsdp
- Known issues
25.09-alpha.rc1
Add fp8 attn knobs
NVIDIA Megatron Core 0.13.1
Merge branch 'cherry-pick-f36e1705' into 'core_r0.13.0' Cherry-pick 'Use ruff linter (3627)' into 'core_r0.13.0' See merge request ADLR/megatron-lm!3793
NVIDIA Megatron Core 0.14.0rc5
Prerelease: NVIDIA Megatron Core 0.14.0rc5 (2025-08-11)
NVIDIA Megatron Core 0.12.3
Merge branch 'chtruong/cherry-pick-3627' into 'core_r0.12.0' Cherry-pick 'use yaml safe load (3627)' into 'core_r0.12.0' See merge request ADLR/megatron-lm!3795