Move verl/rlvr to verl/hyperpod-eks/rlvr by KeitaW · Pull Request #960 · aws-samples/awsome-distributed-training

KeitaW · 2026-02-16T02:24:20Z

Summary

Moves 3.test_cases/pytorch/verl/rlvr/ to 3.test_cases/pytorch/verl/hyperpod-eks/rlvr/ to clarify that this test case is specific to HyperPod EKS, not general EKS usage
Includes all MTC (Managed Tiered Checkpointing) additions from mtc with rlvr #912 in the new location
Updates internal path references in README

Context

Part of cleanup effort for #959. The RLVR + MTC content is HyperPod EKS specific and the directory structure should reflect that.

Test plan

Verify all files moved correctly with no broken internal references
Confirm MTC subdirectory and IRSA setup scripts are present in new location

* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module

…915)

* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updated message to explain why 60 seconds * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3

This reverts commit fca2364.

…king it consistent to Slurm, which now supports EFA for fsx. (#920) The change is checking and making sure the OS is supported for EFA backed FSx and the instance has EFA available before proceeding with client installation. Since fsx is mounted later with eks, we can't verify if fsx is efa enabled or not beforehand, or if the fsx and instance are in the same AZ, but in that case, fsx will automatically fall back to use TCP instead of EFA, so, there is no drawback in installing the client at provisioning time.

Removed warning about major refactoring and deprecated test cases.

* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3 * Moving EFA-FSxL client installation to on_create_main.sh

* Update comment for P5 FI_* in fsdp.yaml-template * Update comment for P5 FI_* environment variable * Update comment for P5 FI_PROVIDER in YAML file * Update comment for P5 FI_* environment variable * Update comment for P5 FI_* configuration * Update comment for P5 FI_* environment variable * Update comment for P5 FI_ configuration in YAML * Update comment for P5 FI_* environment variable * Update comment on FI_PROVIDER in YAML file * Update comment for P5 FI_* configuration * Update comment for P5 FI_* configuration

Fix related to #861 and #935 The correct path for NCCL_TUNER_PLUGIN is /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so

* adding observability module * wired observability module into main * updated to multi-az subnet deployment * added hpto, hpio, logging s3 bucket, and cert-manager eks addon * added alb conroller patching for sagemaker tolerations, auth for kubernetes provider * added task governance, moved all disabled features for RIG into locals * added updates to conditional logic * updated aws provider version, dependency chain for hpto, added guardduty vpce cleanup * modularized hpto and task governance, added waits for node, pod identity, and cert manager, consolidated provider versions * pass cluster_arn to hyperpod_cluster module to avoid forced replacement, fixed grafana rule alerts, extended grafana service token * added fsx_lustre module, existing subnet to az map, git cleanup for helm_releases, and some code tidy * update hpio policies, add oidc provider, code tidy * added lifecycle rules and triggers to avoid forced resource replacement on updated * reverting change on null resource triggers * added additional git cleanup steps, updated README.md * updated examples with availability_zone_id in instance group definition * added instructions for GuardDuty cleanup script * Fixes for Observability addon (#924) * Fixes for observability addOn * Adding region validation for AMP * Adding tag for Sagemaker, fixing the observability addOn condition * added wait for kueue mutating webhook * Fix dashboard location * readme update * updates to FSxL CSI Driver and HPTO pod identity association, README to include training plan examples * code tidy * update default eks version in variables.tf to 1.33 * reverting custom.tfvars to generic naming * adding static FSxL provisioning * configurable fsxl pvc namespace, removed timestamp triggers from helm null resources, updated ignore changes for observability resources * adding ignore lifecycles for variable rereads * convert instance_groups from map to list to allow user to preserve index independent of name alphabetical order --------- Co-authored-by: Madhubalasri-B <madbal@amazon.com> Co-authored-by: Mark Vinciguerra <mvincig@amazon.com>

…vidia-container-toolkit-base (#941)

* Update TF for closed network option * including creating new vpc, subnets, etc in closed network * rm simple copy * removed uncessary image copies from feedback

* enabling custom labels and taints * revert to og

* Update install_docker.sh for containerd configuration ## Summary Fixes sed regex in containerd root configuration to use correct single backslash `\?` instead of double backslash `\\?`. ## Problem The double backslash `\\?` in the sed pattern looks for a literal backslash character, not an optional `#`. This prevents the containerd root configuration from being uncommented and updated. ## Solution Changed sed pattern from `^#\\?root` to `^#\?root` to correctly match optional `#` character. ## Testing Verified on live cluster that: - Double backslash `\\?` fails: `#root = "/var/lib/containerd"` (stays commented) - Single backslash `\?` works: `root = "/opt/sagemaker/containerd/data-root"` (uncommented and updated) ## Changes - Line 84: Containerd config for `/opt/sagemaker` with correct sed - Line 101: Containerd config for `/opt/dlami/nvme` with correct sed Fixes the issue reported in PR #914. * Fix containerd path naming consistency and sed regex - Use consistent naming: containerd/data-root for both paths - Fix sed regex: use \? instead of \? to match optional # - Minimal changes: just sed commands without extra logic - Addresses feedback from PR #914 * Add containerd restart to apply config changes Containerd only reads config at startup, so restart is required to apply the new root path. Verified containerd data is written to the new location after restart.

* Upgrade dependencies in nccl-tests Dockerfile Updated CUDA, EFA, AWS OFI NCCL, NCCL, and NCCL tests versions in the Dockerfile. Update EFA installer to 1.45.1 which supports https://github.com/aws/aws-ofi-nccl/releases/tag/v1.17.2 > Upgrade to libnccl-ofi 1.17.2 (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-changelog.html) EFA_INSTALLER_VERSION=1.45.1 AWS_OFI_NCCL_VERSION=1.17.2 Update NCCL to v2.28.7-1, the latest version supported by aws-ofi-nccl 1.17.2. https://github.com/aws/aws-ofi-nccl/releases > The 1.17.2 release series supports NCCL v2.28.7-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later). > With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric v2.3.1amzn1.0. Update CUDA to 12.9.1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ -> Provides https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libnccl2_2.28.7-1+cuda12.9_amd64.deb * Update NCCL and related package versions Updated versions for EFA installer, NCCL, and NCCL tests. --------- Co-authored-by: Pavel Belevich <belevich@amazon.com>

…#938) * Support new event type "SageMaker HyperPod Cluster Event" * Refer to new workshop page

- Adopt Pythonic idioms (is None, exception handling) - Use defaultdict to simplify nested dictionary initialization - Use context managers for cleaner file handling Signed-off-by: Nathan Na <nzhenye@amazon.com>

* Fix version parsing logic of ofi nccl plugin Signed-off-by: Nathan Na <nzhenye@amazon.com> * docs: fix typos and improve formatting consistency --------- Signed-off-by: Nathan Na <nzhenye@amazon.com>

* ignore venv * make ddp test case compatible with CPU/GPU and add mlflow support * make the test case compatible with Managed MLFlow * update * Update 3.test_cases/pytorch/ddp/slurm/0.create-venv.sh * Fix missing pip install in Dockerfile Add pip install command for mlflow and sagemaker-mlflow packages * Add MLflow error handling and update dependencies for Ubuntu 22.04 - Wrap MLflow calls in try/except so training continues if tracking is unavailable - Remove Ubuntu 20.04 Python version check from create-venv.sh - Update torch 2.1.1->2.10.0, torchvision 0.16.1->0.25.0, unpin numpy

) - Remove --use-mlflow from TORCHRUN_ARGS in container sbatch (crashes torchrun) - Fix undefined ${ENROOT_IMAGE} variable in enroot image script - Fix Kubernetes template: rename fsdp→ddp, fix torchrun path, fix positional args - Update READMEs: replace stale conda/fsdp references with venv/ddp - Fix MLflow default URI documentation to match actual code default - Fix script filenames in READMEs to match actual files on disk

Updated CODEOWNERS to reflect new team for approvals.

* FSDP2: updated files from the sample repo * updated sbatch files with conda activate command * Refactor FSDP training configuration - Downgrade PyTorch from 2.9.1 to 2.7.0 - Split Dockerfile into multi-stage build (Base and HTPO stages) - Remove venv activation from Slurm training scripts - Reverting Kubernetes template configuration * fixing typo HTPO to HPTO * Docker Target does not accept upercase so changed it Lowercase base or hpto and 2.7.0 is not supported so switching it to 2.6.0 for FSDP2 and Hyperpod Elastic agent * Update PyTorch and related package versions Let's come back to this in a separate PR. For now, let me merge the PR before we address #959 --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

* NeMo 2 Performance instructions * Update PERFORMANCE.md * Update PERFORMANCE.md * Address review feedback for NeMo 2 Performance instructions - Add Table of Contents and improve document structure - Add Prerequisites section with NeMo version compatibility matrix - Update EFA installation instructions with links to AWS docs - Add Environment Variables configuration section - Add Expected Outputs section explaining performance metrics - Add Multi-Node Distributed Training section with examples - Fix 'error' placeholder with explanatory note about B200 configuration - Add section headers for Mixtral, Nemotron, and DeepSeek models - Improve Fine-Tuning section with better formatting and HF_TOKEN note - Add comprehensive Troubleshooting section - Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile Fixes review comments from nghtm and KeitaW * Update Slurm workflows to use GitHub-hosted runners with SSH - Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH - Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH - Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH - Add AWS OIDC authentication for all workflows - Add real-time log streaming from p5en.smml.aiml.aws.dev cluster - Add SSH retry logic and job cancellation on workflow abort - Implement enroot image cleanup after test completion - Add new pr-review-and-slurm-test.yml for comprehensive PR testing - Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs --------- Co-authored-by: Pavel Belevich <belevichp@gmail.com> Co-authored-by: Pavel Belevich <belevich@amazon.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

* Fix typo in val_batch_size and remove unused imports Fix typo in val_batch_size and remove unused imports * Fix typo in val_batch_size and remove unused imports --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

The RLVR test case with MTC support is specific to HyperPod EKS and not generally useful for all EKS users. Move the directory to make this distinction clear in the repo structure.

bluecrayon52 and others added 30 commits December 8, 2025 06:41

Minor Updates for RIG Support with Better UX (#906)

61a36b4

* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module

Adding no_root_squash option to prevent root squashing from NFS side (#…

2c46dd1

…915)

Revert "HyperPod EKS Lifecycle Script Improvement (#916)" (#918)

eb8f9ca

This reverts commit fca2364.

Adding 3rd party license information of slurm_exporter. (#928)

29b0f12

Update README to remove refactoring warning (#929)

1937c80

Removed warning about major refactoring and deprecated test cases.

nccl-tests.yaml LD_LIBRARY_PATH for libnccl-net.so (#935)

a6efe92

Update NCCL_TUNER_PLUGIN path in nccl-tests.yaml (#936)

9e2d9bc

Fix related to #861 and #935 The correct path for NCCL_TUNER_PLUGIN is /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so

Check if on_create_main.sh exist and skip if doesn't (#937)

645c1eb

Updating CF stack for HP-EKS to upload multiple LCS files (#932)

32a449f

Detect right version of nvidia-container-toolkit from pre-installed n…

ba28c61

…vidia-container-toolkit-base (#941)

Dynamically find the suffix of libnvidia-compute package version (#943)

8fae1a9

Update TF for closed network option (#939)

03c6318

* Update TF for closed network option * including creating new vpc, subnets, etc in closed network * rm simple copy * removed uncessary image copies from feedback

enabling custom labels and taints (#946)

51223d4

* enabling custom labels and taints * revert to og

AMP & AMG for DCGM Exporter on EKS (#948)

5288ecf

Update CUDA version in nccl-tests-ami.sbatch (#942)

fc6cdbc

Support new EventBridge event type "SageMaker HyperPod Cluster Event" (…

e2c0437

…#938) * Support new event type "SageMaker HyperPod Cluster Event" * Refer to new workshop page

refactor: enhance hostfile_topologify.py readability (#909)

5627949

- Adopt Pythonic idioms (is None, exception handling) - Use defaultdict to simplify nested dictionary initialization - Use context managers for cleaner file handling Signed-off-by: Nathan Na <nzhenye@amazon.com>

Fix formatting and whitespace in EFA node exporter files (#900)

1296e24

* Fix version parsing logic of ofi nccl plugin Signed-off-by: Nathan Na <nzhenye@amazon.com> * docs: fix typos and improve formatting consistency --------- Signed-off-by: Nathan Na <nzhenye@amazon.com>

Update workshop links in README.md

8fc4e4a

Modify CODEOWNERS for HyperPod lifecycle scripts

46710bd

Updated CODEOWNERS to reflect new team for approvals.

paragao and others added 3 commits February 16, 2026 11:03

Fix typo in val_batch_size and remove unused imports (#908)

38eb8d8

* Fix typo in val_batch_size and remove unused imports Fix typo in val_batch_size and remove unused imports * Fix typo in val_batch_size and remove unused imports --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

Move verl/rlvr to verl/hyperpod-eks/rlvr for HyperPod EKS specificity

3e08883

The RLVR test case with MTC support is specific to HyperPod EKS and not generally useful for all EKS users. Move the directory to make this distinction clear in the repo structure.

KeitaW closed this Feb 16, 2026

KeitaW deleted the merge-pr-912-hyperpod-eks branch February 16, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move verl/rlvr to verl/hyperpod-eks/rlvr#960

Move verl/rlvr to verl/hyperpod-eks/rlvr#960
KeitaW wants to merge 33 commits intomtc-rlvrfrom
merge-pr-912-hyperpod-eks

KeitaW commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

KeitaW commented Feb 16, 2026

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants