Releases: AMD-AGI/Primus
Releases · AMD-AGI/Primus
v0.7.0
Docker Version
This release tag is for building Docker v26.1.
What's Changed
- [Primus performance projection] Initial version of pipeline simulation using measured layer-wise latencies. by @yuankaichen-amd in #362
- [Megatron][feat] Add deterministic traning support for megatron-lm backend by @RuibinCheung in #376
- refactor(megatron): Add Megatron MoE patch modules by @Xiaoming-AMD in #443
- refactor(megatron): Add Transformer Engine patches module by @Xiaoming-AMD in #445
- refactor(megatron): Add Primus Turbo patches module by @Xiaoming-AMD in #446
- refactpr(megatron): add zbpp patches for Megatron zero-bubble pipeline by @Xiaoming-AMD in #447
- feat(megatron): add FP8 patches for Megatron fp8 context by @Xiaoming-AMD in #448
- feature(megatron): Megatron PP patches into parallelism package by @Xiaoming-AMD in #449
- feature(megatron): Move MLA attention patch into Megatron patch system by @Xiaoming-AMD in #450
- feature(megatron): Add recompute_layer patches to Megatron patch system by @Xiaoming-AMD in #451
- feature(megatron): Add Torch FSDP2 patch to Megatron patch system by @Xiaoming-AMD in #452
- feature: Integrate Primus runtime patch system with Megatron backend by @Xiaoming-AMD in #453
- Migrate Megatron patch logic to Primus patch framework by @Xiaoming-AMD in #442
- refactor: integrate preflight tool into Primus CLI by @Xiaoming-AMD in #372
- docs: align Megatron example with current config by @Xiaoming-AMD in #454
- RCCL benchmarking update: FSDP and rccl-test command generation by @lorri-rao in #302
- fix(preflight): lazy-load preflight runner to avoid JAX UT import failure by @Xiaoming-AMD in #457
- feat(cli): let pretrain hooks return extra args via stdout by @Xiaoming-AMD in #455
- feat(primus-pipeline): support combined 1f1b for zbv-formatted, fix s… by @ChengYao-amd in #456
- refactor: move TorchTitan trainer patches into backend patch system by @Xiaoming-AMD in #459
- fix up evaluate loss for megatron by @llying-001 in #458
- feat(primus-pipe): optimize communication overlap by @ChengYao-amd in #461
- Align Primus TFLOPS computation with Megatron default definition by @Xiaoming-AMD in #462
- Megatron: align training flow with Primus patch architecture by @Xiaoming-AMD in #460
- Fix and modernize GEMM benchmark script by @Xiaoming-AMD in #465
- CI: Temporarily disable ainic docker image build by @Xiaoming-AMD in #468
- feat: add optional torchft dependencies for fault tolerance support by @zhanglei-amd in #467
- NCCL_IB_GID_INDEX on base_env.sh by @agelman-dn in #466
- feat: allow container name configuration and add user to default name by @olehtika in #347
- feat(runner): improve primus-cli env, hooks, and patch handling by @Xiaoming-AMD in #469
- benchmark: enhance RCCL microbench reporting and CLI integration by @Xiaoming-AMD in #290
- update llama & grok config by @JohnQinAMD in #472
- [Megatron-LM] feat(mxfp4): support mxfp4 in megatron-lm backend by @RuibinCheung in #470
- fix training issue on NVIDIA GPUs by @wenxie-amd in #473
- feat(torchtitan): integrate TorchTitan into Primus new architecture by @Xiaoming-AMD in #463
- Maximize batch sizes for DeepSeek V3 on MI355X by @clairesonglee in #366
- fix(torchtitan): adjust DeepSeek V3 16B batch size to 8 by @Xiaoming-AMD in #474
- Fix Megatron Patch System Initialization Flow by @Xiaoming-AMD in #475
- Improve Slurm CLI entry env setup and logging by @Xiaoming-AMD in #476
- Switch TorchTitan pretrain to core runtime by default and unify TorchTitan patch logging by @Xiaoming-AMD in #477
- Add MaxText Support to primus-cli by @Xiaoming-AMD in #478
- fix tp issue by @wenxie-amd in #479
- Pass
RCCL_*andIONIC_*environment variables by @agelman-dn in #482 - feat(moun): add moun optimizer config by @ChengYao-amd in #483
- Add CLI-based example scripts for training workflow by @Xiaoming-AMD in #480
- megatron(turbo): rely on patch system for missing imports in spec provider by @Xiaoming-AMD in #485
- [Fix] fix quant config arg name by @GeneDer in #488
- Docs: add Benchmark Suite documentation by @Xiaoming-AMD in #491
- benchmark: remove legacy kernel/gemm script by @Xiaoming-AMD in #493
- feat(benchmark): add attention suite; remove legacy kernel attention by @Xiaoming-AMD in #494
- Preflight refactor: unified info/perf flow, simplified flags, richer host+GPU+network report by @Xiaoming-AMD in #490
- megatron: improve compatibility with Megatron-LM v0.10.0+ (args & pretrain API) by @Xiaoming-AMD in #496
- real ip address by @zhanglei-amd in #481
- Bugfix/titan patch by @wenxie-amd in #497
- feat: add build_uccl hook and install rocSHMEM in Dockerfile by @zhenhuang12 in #487
- ci(benchmark): add daily benchmark script by @HuangWei-95 in #489
- Add FP8 Support to GEMM Benchmarks by @Xiaoming-AMD in #499
- Post-Training Framework(Megatron-Bridge) Support by @Xiaoming-AMD in #500
- Enable turbo classic attention by @clairesonglee in #503
New Contributors
- @zhanglei-amd made their first contribution in #467
- @agelman-dn made their first contribution in #466
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Docker Version
This release tag is for building Docker v25.11.
What's Changed
- fix(config): update TorchTitan fp8 configs for new converter and training schema by @Xiaoming-AMD in #236
- feat(torchtitan): auto-install nightly torch based on ROCm version before dataset preparation by @Xiaoming-AMD in #237
- feat(torchtitan): add Qwen3 model configs (0.6B / 1.7B / 32B) by @Xiaoming-AMD in #238
- feat(megatron): Primus-Megatron use PrimusTurboSpecProvider as models backends. by @zhenhuang12 in #231
- Refactor: Organize training configs by GPU architecture (MI300 / MI355) by @Xiaoming-AMD in #240
- fix(compat): add torch API compatibility patches for Titan imports on ROCm 7.0 by @Xiaoming-AMD in #242
- [titan] feat(config): add llama3 & llama3.3 config by @RuibinCheung in #239
- feat(zerobubble): zerobubble add lagacy group gemm & te backend by @ChengYao-amd in #241
- Support Megatron's a2a and deepep overlap in pipeline by @yuankaichen-amd in #244
- feat(torchtitan): add DeepSeek-V3 16B & 671B configs for MI300X and MI355X by @Xiaoming-AMD in #243
- feat(deepep): support fully syncfree moe stage3. by @zhenhuang12 in #232
- fix: parse moe_layer_freq from string to list in Megatron config adapter by @Xiaoming-AMD in #245
- Build Primus Docker (Private) by @wenxie-amd in #247
- chore: update turbo by @xiaobochen-amd in #248
- feat(benchmark/gemm): add base, dense, and deepseek GEMM benchmarks by @Xiaoming-AMD in #226
- upgrade torch to 2.10.0 nightly by @wenxie-amd in #249
- Fix: Align
nn.Embeddingoutput with AMP autocast precision by @Xiaoming-AMD in #251 - fix: fix torchtitan traning issue in TurboAttention by @kyle-256 in #253
- chore: llama-3.1-8b enable turbo by @xiaobochen-amd in #250
- feat(torchtitan): enable dynamic model parameter override via CLI by @Xiaoming-AMD in #254
- updage docker image to nightly; add turbo fa env by @wenxie-amd in #255
- test(torchtitan): add model unit tests for TorchTitan backend by @Xiaoming-AMD in #256
- Feature: Add Mock HuggingFace Dataset Support for TorchTitan by @Xiaoming-AMD in #260
- fix(deepep): fix moe overlap error with sync-free moe. by @zhenhuang12 in #258
- fix(tp-overlap): adapt transformer_engine 2.4 for Megatron backend by @zhenhuang12 in #259
- feat(zero-bubble): reorder comm-nodes for batch-p2p by @ChengYao-amd in #257
- refactor(torchtitan): rollback Titan to 99c0cb2(20250907) and stabilize trainer UTs by @Xiaoming-AMD in #262
- feat(turbo): Add turbo RMSNorm patch by @ChengYao-amd in #263
- fix: Disable double DDP construction inside build_model() via runtime patch by @Xiaoming-AMD in #264
- fix(light-megatron): resolve config parsing and adapter compatibility issues by @Xiaoming-AMD in #265
- Support torch_dist async checkpoint for latest Megatron-LM by @limou102 in #267
- New envs: HSA_KERNARG_POOL_SIZE, ENABLE_NUMA_BINDING by @wenxie-amd in #268
- feat: megatron support turbo fp8 grouped gemm by @xiaobochen-amd in #261
- feat(cli): add dataset preparation hooks for train/pretrain workflows by @Xiaoming-AMD in #270
- strided allgather benchmark by @wenxie-amd in #271
- disable cross entropy flags to avoid convergence loss divergence by @clairesonglee in #269
- [Fix] import MXLinear from Primus Turbo by @GeneDer in #272
- add public primus-safe link in readme by @wenxie-amd in #274
- fix readme's typo error by @wenxie-amd in #275
- Add memory projection cli by @yuankaichen-amd in #273
- feat(CLI): Add Runner Library and Test Suite by @Xiaoming-AMD in #276
- feat(runner): add patch execution system with comprehensive test suite by @Xiaoming-AMD in #279
- feat: Support wgrad in MoE overlap by @yuankaichen-amd in #278
- feature(cli): refactor environment configuration with layered design by @Xiaoming-AMD in #280
- refactor: Runner CLI Refactoring and Optimization by @Xiaoming-AMD in #281
- Documentation Reorganization and Structure Improvements by @Xiaoming-AMD in #283
- docs: reorganize backend patch notes + link docs by @Xiaoming-AMD in #284
- feature(cli): cli auto-discover subcommands and add tests by @Xiaoming-AMD in #285
- specify recompute_layer_ids by @llying-001 in #286
- fix(zerobubble): fix zerobubble pp warmup error by @ChengYao-amd in #287
- fix(token_dispatcher): fix deep_ep internode_combine hang when enable sync-free stage 2. by @zhenhuang12 in #289
- fix loading data bug when using vpp by @lhzhang333 in #277
- benchmark: emit GEMM reports as markdown by @Xiaoming-AMD in #288
- support maxtext backend by @llying-001 in #291
- bench: fix dense gemm pre-commit issues by @Xiaoming-AMD in #295
- feat(IRLens): open-source the IRLens tool by @yihuang-amd in #296
- fix errors when patch_moe_overlap is set and distribute Wgrads to overlap with different comms by @yuankaichen-amd in #294
- deprecate decoder_pipeline_manual_split_list by @lhzhang333 in #293
- feat(config): make Megatron workspace path configurable via env var by @poznano-amd in #298
- feat(IRLens): fix nested sub-computations called via call() operations by @yihuang-amd in #300
- chore: update latest turbo by @xiaobochen-amd in #301
- MoE package - version1 by @wenxie-amd in #282
- fix multi node aiter build bug by @wenxie-amd in #303
- maxtext: add xla_dump_hlo switch, update requirement and dpsk_v2_light config by @llying-001 in #304
- primus ainic docker image by @wenxie-amd in #308
- fix:megatron trainer import save_checkpoint_and_time by @olehtika in #307
- Refactor(config): Unify config inheritance and enhance Megatron args handling by @Xiaoming-AMD in #312
- feature(core): Add Patch Framework for Backend- and Version-Aware Patch Handling by @Xiaoming-AMD in #314
- Print global max mem usage by @wenxie-amd in #317
- backend(maxtext): support custom model condifg and update model args via cli by @llying-001 in #315
- feature(core): Add BackendAdapter unit tests and improve trainer creation workflow stability by @Xiaoming-AMD in #318
- feat(IRLens): use different variable name for nested loops by @yihuang-amd in #319
- feature(core): add BackendRegistry with lazy loading and comprehensive tests by @Xiaoming-AMD in #320
- feature(backends): enhance MegatronArgBuilder argument construction by @Xiaoming-AMD in #322
- feature(core): add TrainerComponent and unify trainer patch workflow by @Xiaoming-AMD in #323
- feat(backends): introduce Megatron backend trainers and pretrain wiring tests by @Xiaoming-AMD in #324
- feat(backends): Add Megatron backend adapter and wiring tests by @Xiaoming-AMD in #325
- fix(benchmark): fix gemm_bench.py duplicate and indent by @olehtika in #306
- feat(core): add unified train runtime orchestrator and optional CLI entry by @Xiaoming-AMD in #327
- backends/megatron: add args patches for paths, logging and wandb by @Xiaoming-AMD in #328
- backends/megatron: add checkpoint patch and platform_config fallback by @Xiaoming-AMD in #329
- backend...
v0.5.0
What's Changed
- fix(config): update TorchTitan fp8 configs for new converter and training schema by @Xiaoming-AMD in #236
- feat(torchtitan): auto-install nightly torch based on ROCm version before dataset preparation by @Xiaoming-AMD in #237
- feat(torchtitan): add Qwen3 model configs (0.6B / 1.7B / 32B) by @Xiaoming-AMD in #238
- feat(megatron): Primus-Megatron use PrimusTurboSpecProvider as models backends. by @zhenhuang12 in #231
- Refactor: Organize training configs by GPU architecture (MI300 / MI355) by @Xiaoming-AMD in #240
- fix(compat): add torch API compatibility patches for Titan imports on ROCm 7.0 by @Xiaoming-AMD in #242
- [titan] feat(config): add llama3 & llama3.3 config by @RuibinCheung in #239
- feat(zerobubble): zerobubble add lagacy group gemm & te backend by @ChengYao-amd in #241
- Support Megatron's a2a and deepep overlap in pipeline by @yuankaichen-amd in #244
- feat(torchtitan): add DeepSeek-V3 16B & 671B configs for MI300X and MI355X by @Xiaoming-AMD in #243
- feat(deepep): support fully syncfree moe stage3. by @zhenhuang12 in #232
- fix: parse moe_layer_freq from string to list in Megatron config adapter by @Xiaoming-AMD in #245
- Build Primus Docker (Private) by @wenxie-amd in #247
- chore: update turbo by @xiaobochen-amd in #248
- feat(benchmark/gemm): add base, dense, and deepseek GEMM benchmarks by @Xiaoming-AMD in #226
- upgrade torch to 2.10.0 nightly by @wenxie-amd in #249
- Fix: Align
nn.Embeddingoutput with AMP autocast precision by @Xiaoming-AMD in #251 - fix: fix torchtitan traning issue in TurboAttention by @kyle-256 in #253
- chore: llama-3.1-8b enable turbo by @xiaobochen-amd in #250
- feat(torchtitan): enable dynamic model parameter override via CLI by @Xiaoming-AMD in #254
- updage docker image to nightly; add turbo fa env by @wenxie-amd in #255
- test(torchtitan): add model unit tests for TorchTitan backend by @Xiaoming-AMD in #256
- Feature: Add Mock HuggingFace Dataset Support for TorchTitan by @Xiaoming-AMD in #260
- fix(deepep): fix moe overlap error with sync-free moe. by @zhenhuang12 in #258
- fix(tp-overlap): adapt transformer_engine 2.4 for Megatron backend by @zhenhuang12 in #259
- feat(zero-bubble): reorder comm-nodes for batch-p2p by @ChengYao-amd in #257
- refactor(torchtitan): rollback Titan to 99c0cb2(20250907) and stabilize trainer UTs by @Xiaoming-AMD in #262
- feat(turbo): Add turbo RMSNorm patch by @ChengYao-amd in #263
- fix: Disable double DDP construction inside build_model() via runtime patch by @Xiaoming-AMD in #264
- fix(light-megatron): resolve config parsing and adapter compatibility issues by @Xiaoming-AMD in #265
- Support torch_dist async checkpoint for latest Megatron-LM by @limou102 in #267
- New envs: HSA_KERNARG_POOL_SIZE, ENABLE_NUMA_BINDING by @wenxie-amd in #268
- feat: megatron support turbo fp8 grouped gemm by @xiaobochen-amd in #261
- feat(cli): add dataset preparation hooks for train/pretrain workflows by @Xiaoming-AMD in #270
- strided allgather benchmark by @wenxie-amd in #271
- disable cross entropy flags to avoid convergence loss divergence by @clairesonglee in #269
- [Fix] import MXLinear from Primus Turbo by @GeneDer in #272
- add public primus-safe link in readme by @wenxie-amd in #274
- fix readme's typo error by @wenxie-amd in #275
- Add memory projection cli by @yuankaichen-amd in #273
- feat(CLI): Add Runner Library and Test Suite by @Xiaoming-AMD in #276
- feat(runner): add patch execution system with comprehensive test suite by @Xiaoming-AMD in #279
- feat: Support wgrad in MoE overlap by @yuankaichen-amd in #278
- feature(cli): refactor environment configuration with layered design by @Xiaoming-AMD in #280
- refactor: Runner CLI Refactoring and Optimization by @Xiaoming-AMD in #281
- Documentation Reorganization and Structure Improvements by @Xiaoming-AMD in #283
- docs: reorganize backend patch notes + link docs by @Xiaoming-AMD in #284
- feature(cli): cli auto-discover subcommands and add tests by @Xiaoming-AMD in #285
- specify recompute_layer_ids by @llying-001 in #286
- fix(zerobubble): fix zerobubble pp warmup error by @ChengYao-amd in #287
- fix(token_dispatcher): fix deep_ep internode_combine hang when enable sync-free stage 2. by @zhenhuang12 in #289
- fix loading data bug when using vpp by @lhzhang333 in #277
- benchmark: emit GEMM reports as markdown by @Xiaoming-AMD in #288
- support maxtext backend by @llying-001 in #291
- bench: fix dense gemm pre-commit issues by @Xiaoming-AMD in #295
- feat(IRLens): open-source the IRLens tool by @yihuang-amd in #296
- fix errors when patch_moe_overlap is set and distribute Wgrads to overlap with different comms by @yuankaichen-amd in #294
- deprecate decoder_pipeline_manual_split_list by @lhzhang333 in #293
- feat(config): make Megatron workspace path configurable via env var by @poznano-amd in #298
New Contributors
- @GeneDer made their first contribution in #272
- @poznano-amd made their first contribution in #298
Full Changelog: v0.4.0...v0.5.0
v0.4.0
What's Changed
- fix(config): correct flavor to 405B in torchtitan/llama3.1_405B.yaml by @Xiaoming-AMD in #189
- perf(torchtitan/config): enable compile for Llama-3.1 (8B/70B/405B) by @Xiaoming-AMD in #193
- disable dump_pp_data when pp size is one by @lhzhang333 in #191
- remove turbo token by @wenxie-amd in #197
- feat(async-tp) change gemm_rs_overlap api for multi-stream method by @llying-001 in #171
- Support for torchtitan with Primus-Turbo by @clairesonglee in #188
- chore: update default rocm/megatron-lm image to v25.8_py310 by @Xiaoming-AMD in #198
- perf(aiter): add AITER_JIT_DIR env for cached build to speed up re-compilation by @Xiaoming-AMD in #199
- feat: align primus-turbo fp8 linear's args to megatron by @RuibinCheung in #195
- Add wandb_enable config and Torchtitan unit tests by @zitree in #194
- fix: wrapper turbo quant config in megatron extension by @RuibinCheung in #202
- feat(cli): add Python-based
primusentrypoint for PATH installation by @Xiaoming-AMD in #200 - feat(zero-bubble): support zero bubble pipeline parallism by @ChengYao-amd in #208
- Primus product matrix by @wenxie-amd in #210
- fix: remove MXQuantConfig from titan and add warning msg by @RuibinCheung in #212
- fix 8B perf regression (v25.9) by @wenxie-amd in #215
- feat(zero-bubble): support GroupGemm wgrad split, add debug_scheduler_table flag by @ChengYao-amd in #213
- add support for grok1 by @JohnQinAMD in #216
- improve torch profiling by @wenxie-amd in #218
- supports: userId for request by @weilei0120 in #214
- support mlflow tracking by @wenxie-amd in #219
- feat: Update Megatron-LM to 8477817(20251011) by @Xiaoming-AMD in #221
- test(megatron): add Qwen2.5-7B and Qwen2.5-72B pretrain cases by @Xiaoming-AMD in #222
- feat(CLI): add unified shell entry scripts for Slurm, container, and direct modes by @Xiaoming-AMD in #209
- Add tensor size print for comm op benchmark by @lorri-rao in #223
- fix(megatron): fix bugs for fitting the newest megatron by @ChengYao-amd in #224
- Docker Release v25.9 by @wenxie-amd in #217
- Add grok2 model support by @wenxie-amd in #227
- Use PRIMUS_xxx env, export all envs for slurm by @wenxie-amd in #229
- feat(deepep): add PrimusTurboDeepEPTokenDispatcher and support syncfree moe stage 0-2 by @zhenhuang12 in #220
- upgrade(torchtitan): sync torchtitan to 5fb7cc2e3bbb9b9dc0ab7af34ed5cc58b5f32021 (2025-10-16) by @Xiaoming-AMD in #228
- chore(docker): update default image to rocm/primus:v25.9_gfx942 by @Xiaoming-AMD in #230
- fix(tests): add missing expecttest dependency for distributed tests by @Xiaoming-AMD in #233
- fix(config): use 1.0e-2 for moe_aux_loss_coeff to ensure correct float parsing by @Xiaoming-AMD in #234
New Contributors
- @clairesonglee made their first contribution in #188
- @zitree made their first contribution in #194
- @lorri-rao made their first contribution in #223
Full Changelog: v0.2.0...v0.4.0
v0.3.0
What's Changed
- fix(config): correct flavor to 405B in torchtitan/llama3.1_405B.yaml by @Xiaoming-AMD in #189
- perf(torchtitan/config): enable compile for Llama-3.1 (8B/70B/405B) by @Xiaoming-AMD in #193
- disable dump_pp_data when pp size is one by @lhzhang333 in #191
- remove turbo token by @wenxie-amd in #197
- feat(async-tp) change gemm_rs_overlap api for multi-stream method by @llying-001 in #171
- Support for torchtitan with Primus-Turbo by @clairesonglee in #188
- chore: update default rocm/megatron-lm image to v25.8_py310 by @Xiaoming-AMD in #198
- perf(aiter): add AITER_JIT_DIR env for cached build to speed up re-compilation by @Xiaoming-AMD in #199
- feat: align primus-turbo fp8 linear's args to megatron by @RuibinCheung in #195
- Add wandb_enable config and Torchtitan unit tests by @zitree in #194
- fix: wrapper turbo quant config in megatron extension by @RuibinCheung in #202
- feat(cli): add Python-based
primusentrypoint for PATH installation by @Xiaoming-AMD in #200 - feat(zero-bubble): support zero bubble pipeline parallism by @ChengYao-amd in #208
- Primus product matrix by @wenxie-amd in #210
- fix: remove MXQuantConfig from titan and add warning msg by @RuibinCheung in #212
- fix 8B perf regression (v25.9) by @wenxie-amd in #215
- feat(zero-bubble): support GroupGemm wgrad split, add debug_scheduler_table flag by @ChengYao-amd in #213
New Contributors
- @clairesonglee made their first contribution in #188
- @zitree made their first contribution in #194
Full Changelog: v0.2.0...v0.3.0
v0.2.0
What's Changed
- feat: Unify config/backend CLI & add config export support by @Xiaoming-AMD in #151
- feat: reduce cpu sync of moe_router_force_load_balancing by @RuibinCheung in #153
- feat(light-megatron): add LightMegatronPretrainTrainer with clean config-based integration by @Xiaoming-AMD in #136
- fix(docker): Use
docker_podman_proxyfor container cleanup by @Xiaoming-AMD in #157 - feat(moe): fused moe router add scatter logics, modify flags to primus_turbo.yaml by @ChengYao-amd in #141
- feat(turbo): update turbo grouped gemm bf16/fp16 by @xiaobochen-amd in #149
- fix(pp): fix the validation issue when vpp is not set in manual split mode by @lhzhang333 in #161
- Add initial llama4 configs by @chriscai-amd in #163
- (ut)add megatron ut scripts by @llying-001 in #164
- refactor(attn): update attention utils interface by @ChengYao-amd in #159
- Update Llama-4-Scout-17B-16E Megatron Configs by @chriscai-amd in #165
- update log/wandb/tensorboard by @wenxie-amd in #169
- [Llama4] Add Llama4 17B128E Maverick config by @chriscai-amd in #172
- feat(turbo): attn interface fit turbo by @ChengYao-amd in #173
- turn on manual gc by @wenxie-amd in #175
- add userid to header by @weilei0120 in #177
- (feat)async tp: adapt async-tp for te2.x api by @llying-001 in #178
- [Perf Issue] Disable manual_gc by default and update rocm_mem behavior by @wenxie-amd in #179
- update proxy model config by @wenxie-amd in #167
- upgrade docker image by @wenxie-amd in #176
- Enable turbo v25.8 by @vidushi8 in #180
- fix wandb/tensorboard mem item by @wenxie-amd in #181
- (test) add torchtitan ut and integration test by @llying-001 in #170
- add te fused cross entropy argument by @wenxie-amd in #182
- make pp_data_dir configurable and add pp_vis dependencies by @lhzhang333 in #183
- pp_warmup optimization by @lhzhang333 in #185
- move clean step into UT by @wenxie-amd in #186
New Contributors
- @chriscai-amd made their first contribution in #163
- @weilei0120 made their first contribution in #177
Full Changelog: v0.1.0-rc1...v0.2.0
First Version Release
What's Changed
- hipblaslt auto tune by @wenxie-amd in #23
- fix(examples): fix Megatron path used in prepare_dataset by @Xiaoming-AMD in #24
- fix(torch_fsdp): get ddp_config failed when use torch_fsdp by @Xiaoming-AMD in #27
- preflight by @wenxie-amd in #25
- merge trace file by @wenxie-amd in #31
- add inter-node ring p2p test by @limou102 in #30
- feat(megatron): Align TFLOPs calculation for megatron by @Xiaoming-AMD in #28
- [Fix] fp8 option not work by @RuibinCheung in #33
- feat(HSA ENV): tune ROCm runtime with HSA_NO_SCRATCH_RECLAIM and HSA_ENABLE_SDMA by @Xiaoming-AMD in #32
- feat(config parse): replace yaml value(int/float ) from env('KEY') by @Xiaoming-AMD in #34
- feat(fsdp): patch Megatron torch_FSDP2 with Primus implementation by @Xiaoming-AMD in #35
- Add README by @wenxie-amd in #36
- fix typo error of preflight script by @wenxie-amd in #37
- [Feat] Add tensile tuning example by @RuibinCheung in #38
- refactor(examples): simplify usage and improve structure for clarity by @Xiaoming-AMD in #39
- test:Add model-specific Megatron trainer test cases with isolated test config by @Xiaoming-AMD in #40
- Primus benchmark by @xiaobochen-amd in #43
- docs(contributing): add initial contributing guidelines by @Xiaoming-AMD in #42
- Dev/yaoc/mixtral by @ChengYao-amd in #44
- fix fast async checkpoint on ROCm by @limou102 in #46
- docs & refactor: reorganize README, unify config usage, and improve Megatron pretrain scripts for Primus by @Xiaoming-AMD in #45
- optimize: reduce FP8 training memory usage via scoped TE layer config overrides by @Xiaoming-AMD in #47
- fix(megatron): add missing import for 'inspect' in TE kwargs patch by @Xiaoming-AMD in #49
- chore(submodule): update Megatron-LM from 20250324 to 20250522 by @Xiaoming-AMD in #50
- refactor: improve benchmark runner and report parser with multi-node support by @Xiaoming-AMD in #51
- feature(model): add mixtral pretrain config by @ChengYao-amd in #52
- update trace_moe_metric call to fit new megatron interface by @ChengYao-amd in #53
- fix(Megatron): fix interleaved virtual pipeline training error and add corresponding UT by @lhzhang333 in #54
- opt(UT): add num_workers=1 in UT yaml to save most of the time on exit by @lhzhang333 in #56
- Update mixtral pretrain configs by @yuankaichen-amd in #55
- refactor(docker): Update docker image to v25.5_py310 by @wenxie-amd in #57
- feat(config): Update LLaMA pretrain configs by @Xiaoming-AMD in #58
- feature(RDMA): Add filtering for gpu RDMA network adapters by @chaojhou in #59
- fix(trainer-test): improve training script success detection using stdout … by @Xiaoming-AMD in #63
- chore(license): add MIT LICENSE file for Primus by @Xiaoming-AMD in #61
- refactor: move Megatron run scripts to examples root and add --backend parameter for multi-backend support by @Xiaoming-AMD in #64
- feat(torchtitan): Add TorchTitan Backend Support (Initial Stub) by @Xiaoming-AMD in #65
- feat(torchtitan): add --local-ranks-filter support in torchrun launcher by @Xiaoming-AMD in #67
- fix(slurm): remove --reservation flag and quote variables in run_slurm_pretrain.sh by @Xiaoming-AMD in #68
- feat(megatron): enable manual pipeline split in (interleaved) 1F1B-PP by monkey patching by @lhzhang333 in #69
- rebase main to instella branch by @wenxie-amd in #71
- fix(ip-interface): socket interface env regression by @Xiaoming-AMD in #70
- feat: Add run_k8s_pretrain interface for Kubernetes workload submission by @Xiaoming-AMD in #72
- feat(run_k8s_pretrain): support --workspace and improve job spec defaults by @Xiaoming-AMD in #73
- feat(megatron): support mock_data mode to skip dataset preparation by @Xiaoming-AMD in #74
- feat(k8s_pretrain): support log in stdout and file by @chaojhou in #76
- feat(k8s): Support for Node Selection via --nodelist and Add nodes by @Xiaoming-AMD in #75
- docs: add TorchTitan backend support entry to README by @Xiaoming-AMD in #78
- add benchmark for checkpoint saving by @limou102 in #81
- feat(torchtitan):Add model configs for LLaMA3-405B and LLaMA3-70B (TorchTitan) by @Xiaoming-AMD in #82
- feat(tp-overlap): add te backend and support tp overlap for megatron. by @zhenhuang12 in #79
- feat(benchmark): update kernel benchmark and add llama405B config by @xiaobochen-amd in #77
- llama3.1_405B model config by @wenxie-amd in #84
- print training envs by @wenxie-amd in #85
- add checkpoint loading metrics by @limou102 in #86
- feat: add new ckpt args of megatron by @wenxie-amd in #88
- doc: Add Mistral Models and Fix Formatting in examples/README.md by @Xiaoming-AMD in #87
- refactor(cli): Enhance Primus CLI with --override Support & Simplify Platform Defaults by @Xiaoming-AMD in #89
- chore(license): add AMD license headers by @Xiaoming-AMD in #90
- feat(k8s launch):Support forwarding unrecognized --args to ENTRY_POINT by @Xiaoming-AMD in #91
- fix(megatron): sync initialize_megatron of primus with that of megatron by @lhzhang333 in #93
- enable deepseek qk_layernorm by @wenxie-amd in #94
- checkout Primus-Turbo by github secret by @wenxie-amd in #96
- feat(tp-overlap): support torchtitan by patch fused_all_gather_matmul of torch op by @zhenhuang12 in #92
- add deprecated_20251209 moe layer by @wenxie-amd in #98
- feat(megatron): add attn warmup to save iter1's time when pp is used by @lhzhang333 in #97
- Primus Config/Patch Document by @wenxie-amd in #100
- feat(megatron): enable dumping pp schedule data and add pp visualization tool by @lhzhang333 in #99
- add patch readme for attn_warmup and decoder_pipeline_manual_split_list by @lhzhang333 in #101
- feat(megatron): add model and pretrain config for LLaMA3.1-405B by @Xiaoming-AMD in #102
- refactor: Refactor Torchtitan Config & Launch: YAML Unification, Backend Auto-Selection by @Xiaoming-AMD in #106
- refactor(torchtitan): switch llama3 configs from TOML to YAML by @Xiaoming-AMD in #108
- doc(examples): Rename Torchtitan LLaMA3 Configs to LLaMA3.1 and Update README Links by @Xiaoming-AMD in #110
- Add tas k8s runner's ci file by @haishuok0525 in #109
- test(megatron): add Mixtral-8x22B/Mixtral-8x7B test and TRAIN_LOG override support by @Xiaoming-AMD in #114
- Speedup primus-turbo build in k8s-ci runner by @wenxie-amd in #113
- fix(trainer): auto-enable tensorboard when profiling is enabled by @Xiaoming-AMD in #116
- Code isolation from shared path by @haishuok0525 in #119
- feat(megatron): add moe_use_fused_router_with_aux_score by @ChengYao-amd in #111
- [UT] Add deterministic extra check and unit test by @RuibinCheung in h...