What's Changed
- Update
eksctlcluster versions and ML CBR usage by @bryantbiggs in #573 - Deleting SMP/SMDDP test-cases by @shimomut in #617
- adding picotron by @KeitaW in #584
- Update readme, deprecate test cases, and move Pytorch test cases under pytorch subdirectory by @KeitaW in #620
- Add EKS node autorepair example cluster manifest by @iankouls-aws in #619
- added AmazonEKS_CNI_Policy to SM Exec Role by @bluecrayon52 in #624
- Reduce efa exporter container images by @mhuguesaws in #611
- Change EFA, NCCL version in pipeline by @mhuguesaws in #626
- added DOCKER_NETWORK and env_var persistence for SageMaker Code Editor use at AWS Events by @bluecrayon52 in #623
- updated fsx_ubuntu.sh script with wait loop by @bluecrayon52 in #633
- Change PyTorch version for FSDP case and remove conda by @mhuguesaws in #629
- Change prometheus version for SMHP by @mhuguesaws in #628
- Openzfs smhp by @amanshanbhag in #622
- Fix cloudwatch access from Grafana by @mhuguesaws in #627
- Fixing recently raised Studio Issues by @amanshanbhag in #640
- Terraform Modules for HyperPod EKS by @bluecrayon52 in #586
- Slurm cluster creation issues by @amanshanbhag in #641
- Update 0.distributed-training.Dockerfile by @KeitaW in #645
- Improvements/fsdp restructure by @mhuguesaws in #630
- Add automated Grafana dashboard deployment by @mhuguesaws in #607
- Fix FSDP to use venv first by @mhuguesaws in #650
- nvshmem by @pbelevich in #599
- Update install_enroot_pyxis.sh by @KeitaW in #661
- feat: Add Hyperpod Optimum-neuron LoRA example by @Captainia in #631
- Adding custom dcgm metrics for EKS by @nadknish in #666
- re-adding deepspeed by @KeitaW in #659
- Lcc studio jl by @amanshanbhag in #669
- Update 0.distributed-training.Dockerfile by @nicolaven in #671
- utility to dump details of all nodes in a cluster, into a csv file by @amitosaurus in #652
- Update setup_mariadb_accounting.sh with apg installation by @amanshanbhag in #672
- U 2204 patch -- update from #672 by @amanshanbhag in #673
- Upgrade pinned version of Ansible by @amanshanbhag in #681
- Nghtm patch 2 by @nghtm in #683
- Fix minor spelling mistake in start_slurm.sh by @sammyhori in #686
- Fix nvidia container toolkit to 1.17.6 by @mhuguesaws in #689
- Update 2.SageMakerVPC.yaml by @nghtm in #691
- Skip fsx_ubuntu.sh execution when no FSx parameters are provided in the provisioning parameters by @vaikor-amazon in #692
- Change nccl-tests to have cuda version by @mhuguesaws in #694
- Adding a template for HyperPod EventBridge email notifications by @shimomut in #687
- Improvements/nccl cuda verison bump by @mhuguesaws in #695
- ec2 get metadata replacement by @gmgtamz in #515
- Replacing ********* with localhost in OZFS mount script by @amanshanbhag in #696
- Adding ssh keys to additional (OZFS at
/home) file system by @amanshanbhag in #700 - [feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. by @divincode in #698
- fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… by @mvinci12 in #701
- Fixing Race Conditions reported in #674 by @amanshanbhag in #703
- feat: Add LoRA fine-tuning optimum-neuron example for slurm by @Captainia in #643
- Fsdp regression tests by @amanshanbhag in #714
- Fix FSDP venv creation by @mhuguesaws in #720
- Updating venv test case for FSDP to point to correct
train.pyby @amanshanbhag in #725 - Bump requests from 2.32.0 to 2.32.4 in /3.test_cases/pytorch/bionemo by @dependabot[bot] in #727
- new commit for fixing fsdp dataset, using allenai/c4 with HF token by @mvinci12 in #729
- Adding test configs to matrix by @amanshanbhag in #731
- Change FSDP steps and checkpoint steps by @mhuguesaws in #730
- Incorrect indent in container reg test by @amanshanbhag in #732
- Change FSDP steps to reduce time by @mhuguesaws in #734
- Adding SMHP test cluster to matrix (venv) by @amanshanbhag in #740
- Fixing path to match readme instructions by @amanshanbhag in #742
- Feat/picotron resume from checkpoint by @KeitaW in #656
- Fix FSDP venv run by @mhuguesaws in #733
- slurm and eks readme edits by @mvinci12 in #735
- Change FSDP PyTorch to 2.7.1 by @mhuguesaws in #739
- Change FSDP to truncate dataset by @mhuguesaws in #743
- fix typo in NCCL tests README by @KeitaW in #746
- Enable 1click for SageMaker HyperPod by @mhuguesaws in #670
- Fix FSDP requirements.txt to effectively use cuda 128 by @mhuguesaws in #748
- Terraform Modules Updates by @bluecrayon52 in #744
- HyperPod EKS Helper Script Fixes by @bluecrayon52 in #709
- Observability change target scrapping rate to 1 minute by @mhuguesaws in #750
- Fix FSDP destroy process group by @mhuguesaws in #749
- docker library version on eks by @mvinci12 in #753
- Add GPU Health, Slurm exporter to 1click observability by @mhuguesaws in #751
- Add DCGM exporter dashboard with hostnames by @mhuguesaws in #752
- adding llamav3 support on slurm and EKS by @allela-roy in #737
- updating FSDP slurm documentation by @allela-roy in #745
- Updating Parallelcluster deployment guide by @KeitaW in #721
- Update README.md by @nghtm in #755
- updating FSDP EKS documentation by @allela-roy in #756
- refactoring megatron-lm test case by @KeitaW in #637
- update sagemakerimageversionalias by @mvinci12 in #759
- Update README.md by @KeitaW in #757
- Add NeMo 2.0 on EKS example by @olaoyea4 in #761
- Add helper scripts and Update instructions to Nemo example on EKS by @olaoyea4 in #763
- Update push.sh - remove backslash from repo name by @iankouls-aws in #764
- Update push.sh - remove extra double quote by @iankouls-aws in #765
- Update data path in megatron-lm k8s test case by @KeitaW in #762
- Change NCCL tests EFA 1.42 and plugin to 1.16.0 by @mhuguesaws in #766
- Create fsdp-eks-regression.yml by @amanshanbhag in #754
- Adding ACTUAL_JOB_NAME back to fsdp-eks-regression by @amanshanbhag in #767
- Grafana slack by @nghtm in #760
- Update README.md by @hariby in #770
- Fix typo on README.md and cluster-vanilla.yaml by @koshieguchi in #771
- Add Megatron-LM container build CI by @mhuguesaws in #772
- Change FSDP container CI to have container build shared across test cases. by @mhuguesaws in #768
- Change Megatron-LM versions by @mhuguesaws in #773
- Lustre mount via Ansible for SMHP Slurm LCS by @amanshanbhag in #682
- Nghtm patch 2 by @nghtm in #774
- Bump transformers from 4.36.0 to 4.52.1 in /3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes/src by @dependabot[bot] in #775
- Do not apply slurm_pam_adopt on login nodes by @xkoegler-h in #778
- Add retry logic w/ exponential backoff to get_ip_address() function. by @giuseppeporcelli in #780
- Update venv.sh by @aroraakshit in #779
- fix(mount_fs): improve FSx Lustre mount reliability with retries and … by @nitishsaik in #782
- feat: Update LCS of hyperpod eks to support AL2023 OS by @prajjwal-24 in #783
- Adding pod checks for "Succeeded" for worker pods (since pytorchjob s… by @amanshanbhag in #786
- Resolving error message for HF trust remote code by @amanshanbhag in #785
- Update 0.llm-foundry.Dockerfile by @KeitaW in #787
- Resolving "Unsupported block type" error in Terraform by @allela-roy in #788
- Fix typo in README.md for gpt3 test case by @renanmagagnin in #789
- Upgrading helm provider to v3.0.0 to support blocks to nested objects by @bluecrayon52 in #790
- Adding Region to HMA Helm chart for Nova customization with Restricted Instance Group by @bluecrayon52 in #798
- Adding example for using HyperPod training operator by @aruncs2005 in #802
- Updating EFA validation script CUDA version and OFI NCCL path by @bluecrayon52 in #800
- Update README.md by @KeitaW in #801
- add GPU accounting for SMHP by @KeitaW in #462
- Feature/slinkly slurm hyperpod eks by @bluecrayon52 in #651
- Multi-node LLM post-training with GRPO by @pbelevich in #684
- Updating NeMo 2.0 test case with latest Dockerfile by @amanshanbhag in #805
- Update Dockerfile with required packages for perf testing by @amanshanbhag in #806
- Adding efa for backwards compatibility by @amanshanbhag in #810
- Update README.md to resolve CF stack deployment error by @allela-roy in #818
- Add note to OpenZFS for Slinky and modify config by @andjsmi in #811
- regionalized hma helm deployment and simplified helm_chart module by @bluecrayon52 in #819
- patch(no-op): Add No Op for Al2023 to use default root for containerd by @nitishsaik in #820
- Adding script to parse node labels for cluster topology by @nadknish in #821
- Upgrade NCCL version to same as tests by @amanshanbhag in #823
- Revising instance types on topology visualizer by @nadknish in #822
- Implement AL2023 containerd configuration with custom data-root by @prajjwal-24 in #825
- Update containerd config for local pause image and data migration by @prajjwal-24 in #827
- Update efa-cheatsheet.md by @nghtm in #834
- Added support for nccl tests on gb200 on k8s by @harishvs in #830
- Update Neuronx-distributed llama3 example with newer NxD container im… by @flamingofugang in #829
- efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9 by @pbelevich in #836
- Modify NVLinks table instance reference by @KeitaW in #835
- NVSHMEM 3.3.9 by @pbelevich in #837
- add default log rotation for Slurm daemon logs by @taewonk-amzn in #824
- CFN/TF HyperPod EKS Parameter Updates by @bluecrayon52 in #841
- Add SM 100 for GB200 by @pbelevich in #845
- Update package versions in nemo Dockerfile by @allela-roy in #847
- Update run.py to fix missing import by @allela-roy in #850
- Removing RDMA option for EFA from NCCL tests by @amanshanbhag in #851
- updated easy ssh script to add cluster names with similar prefix by @aravneelaws in #842
- Adding Terraform modules for SMHP Slurm by @allela-roy in #839
- HyperPod Slurm observability improvement by @shimomut in #857
- Distillation example by @chivatam in #758
- Delete micro-benchmarks/nccl-tests/nccl-tests-gb200.Dockerfile by @pbelevich in #844
- Remove Install AWS-OFI-NCCL plugin by @pbelevich in #861
- Ld path fix by @amanshanbhag in #862
- Bump transformers from 4.52.4 to 4.53.0 in /3.test_cases/pytorch/FSDP/src by @dependabot[bot] in #863
- Update for SDK 2.25 by @jimburtoft in #858
- Added support for topographical ordering of hostnames in mpi run by @harishvs in #846
- studio stack that includes fsx integration by @mvinci12 in #870
- Updating ddp.py path by @krao14 in #869
- Bump torch from 2.7.1 to 2.8.0 in /3.test_cases/pytorch/distillation/src by @dependabot[bot] in #860
- Added unit tests for hostfile_topologify.py by @harishvs in #866
- Fixing Slurm exporter installation script for HyperPod Slurm by @shimomut in #874
- Update nccl-tests-gb200.yaml by @RobertNorthard in #876
- updated mount_fsx.sh to configure FSxL EFA client by @mayankgupta14 in #849
- Add progress logs (debug) + specific branch (v3) to lambda_function.py by @amanshanbhag in #882
- Nodes upgraded from Amazon Linux 2 to AL23 fail to create pod sandboxes by @anshuman8800 in #884
- Update GB200 NCCL tests configuration for p6e instances by @nghtm in #886
- EFA Node exporter updates by @rpovelik in #885
- minor patch for AL2023 by @aravneelaws in #883
- Update EFA Node exporter dockerfile to latest procfs and node exporte… by @rpovelik in #887
- Fix version parsing logic of ofi nccl plugin by @Zhenye-Na in #894
- Add FSx OpenZFS support to SMHP Terraform modules by @allela-roy in #890
- nccl-tests/Dockerfile - Add support for custom aws-ofi-nccl & cleanup by @erezzarum in #881
- aravneel update lcs - openzfs mount and user add permissions by @aravneelaws in #899
- RLVR Recipe in added post-training section by @mvinci12 in #891
- fixing and updating megatron-lm sample by @allela-roy in #903
- Terraform Mods for RIG support by @bluecrayon52 in #902
- AWS Batch P6-B200 Distributed Training with Multi-Node Parallel Support by @cyberchip-wang in #893
- feat(nodeadm): enable CDI by default in containerd config by @anshuman8800 in #907
- dynamically set Global Batch Size by @allela-roy in #904
- ray dashboard integration improvement by @mvinci12 in #905
- Revert "nccl-tests/Dockerfile - Add support for custom aws-ofi-nccl &… by @pbelevich in #910
- Minor Updates for RIG Support with Better UX by @bluecrayon52 in #906
- Adding no_root_squash option to prevent root squashing from NFS side by @amanshanbhag in #915
- HyperPod EKS Lifecycle Script Improvement by @shimomut in #916
- Revert "HyperPod EKS Lifecycle Script Improvement" by @shimomut in #918
- Adding 3rd party license information of slurm_exporter by @shimomut in #928
- Updating on_create.sh to install and configure EFA client for FSx. Ma… by @mayankgupta14 in #920
- Update README to remove refactoring warning by @KeitaW in #929
- HyperPod EKS Lifecycle Script Improvement by @shimomut in #927
- Update comment for P5 FI_* in fsdp.yaml-template by @KeitaW in #934
- nccl-tests.yaml LD_LIBRARY_PATH for libnccl-net.so by @pbelevich in #935
- Update NCCL_TUNER_PLUGIN path in nccl-tests.yaml by @pbelevich in #936
- Terraform Updates for SageMaker HyperPod EKS by @bluecrayon52 in #926
- Check if on_create_main.sh exist and skip if doesn't by @shimomut in #937
- Updating CF stack for HypePod EKS to upload multiple LCS files by @shimomut in #932
- Fixing DCGM exporter container failure by @shimomut in #941
- Dynamically find the suffix of libnvidia-compute package version by @shimomut in #943
- Update TF for closed network option by @mvinci12 in #939
- enabling custom labels and taints by @mvinci12 in #946
- AMP & AMG for DCGM Exporter on EKS by @mvinci12 in #948
- Update install_docker.sh for containerd configuration by @PremiumSpider in #945
- Update CUDA version in nccl-tests-ami.sbatch by @KeitaW in #942
- Upgrade dependencies in nccl-tests Dockerfile by @KeitaW in #925
- Support new EventBridge event type "SageMaker HyperPod Cluster Event" by @shimomut in #938
- refactor: enhance hostfile_topologify.py readability by @Zhenye-Na in #909
- Fix formatting and whitespace in EFA node exporter files by @Zhenye-Na in #900
- Feat/ddp mlflow by @KeitaW in #655
- Fix DDP documentation and script bugs from conda-to-venv migration by @KeitaW in #955
New Contributors
- @Captainia made their first contribution in #631
- @nadknish made their first contribution in #666
- @nicolaven made their first contribution in #671
- @amitosaurus made their first contribution in #652
- @sammyhori made their first contribution in #686
- @vaikor-amazon made their first contribution in #692
- @divincode made their first contribution in #698
- @mvinci12 made their first contribution in #701
- @hariby made their first contribution in #770
- @koshieguchi made their first contribution in #771
- @xkoegler-h made their first contribution in #778
- @nitishsaik made their first contribution in #782
- @prajjwal-24 made their first contribution in #783
- @renanmagagnin made their first contribution in #789
- @andjsmi made their first contribution in #811
- @harishvs made their first contribution in #830
- @flamingofugang made their first contribution in #829
- @taewonk-amzn made their first contribution in #824
- @aravneelaws made their first contribution in #842
- @chivatam made their first contribution in #758
- @krao14 made their first contribution in #869
- @RobertNorthard made their first contribution in #876
- @mayankgupta14 made their first contribution in #849
- @anshuman8800 made their first contribution in #884
- @rpovelik made their first contribution in #885
- @Zhenye-Na made their first contribution in #894
- @erezzarum made their first contribution in #881
- @cyberchip-wang made their first contribution in #893
- @PremiumSpider made their first contribution in #945
Full Changelog: v1.1.0...v1.2.0