Closed
Conversation
* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module
* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updated message to explain why 60 seconds * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3
…king it consistent to Slurm, which now supports EFA for fsx. (#920) The change is checking and making sure the OS is supported for EFA backed FSx and the instance has EFA available before proceeding with client installation. Since fsx is mounted later with eks, we can't verify if fsx is efa enabled or not beforehand, or if the fsx and instance are in the same AZ, but in that case, fsx will automatically fall back to use TCP instead of EFA, so, there is no drawback in installing the client at provisioning time.
Removed warning about major refactoring and deprecated test cases.
* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3 * Moving EFA-FSxL client installation to on_create_main.sh
* Update comment for P5 FI_* in fsdp.yaml-template * Update comment for P5 FI_* environment variable * Update comment for P5 FI_PROVIDER in YAML file * Update comment for P5 FI_* environment variable * Update comment for P5 FI_* configuration * Update comment for P5 FI_* environment variable * Update comment for P5 FI_ configuration in YAML * Update comment for P5 FI_* environment variable * Update comment on FI_PROVIDER in YAML file * Update comment for P5 FI_* configuration * Update comment for P5 FI_* configuration
* adding observability module * wired observability module into main * updated to multi-az subnet deployment * added hpto, hpio, logging s3 bucket, and cert-manager eks addon * added alb conroller patching for sagemaker tolerations, auth for kubernetes provider * added task governance, moved all disabled features for RIG into locals * added updates to conditional logic * updated aws provider version, dependency chain for hpto, added guardduty vpce cleanup * modularized hpto and task governance, added waits for node, pod identity, and cert manager, consolidated provider versions * pass cluster_arn to hyperpod_cluster module to avoid forced replacement, fixed grafana rule alerts, extended grafana service token * added fsx_lustre module, existing subnet to az map, git cleanup for helm_releases, and some code tidy * update hpio policies, add oidc provider, code tidy * added lifecycle rules and triggers to avoid forced resource replacement on updated * reverting change on null resource triggers * added additional git cleanup steps, updated README.md * updated examples with availability_zone_id in instance group definition * added instructions for GuardDuty cleanup script * Fixes for Observability addon (#924) * Fixes for observability addOn * Adding region validation for AMP * Adding tag for Sagemaker, fixing the observability addOn condition * added wait for kueue mutating webhook * Fix dashboard location * readme update * updates to FSxL CSI Driver and HPTO pod identity association, README to include training plan examples * code tidy * update default eks version in variables.tf to 1.33 * reverting custom.tfvars to generic naming * adding static FSxL provisioning * configurable fsxl pvc namespace, removed timestamp triggers from helm null resources, updated ignore changes for observability resources * adding ignore lifecycles for variable rereads * convert instance_groups from map to list to allow user to preserve index independent of name alphabetical order --------- Co-authored-by: Madhubalasri-B <madbal@amazon.com> Co-authored-by: Mark Vinciguerra <mvincig@amazon.com>
…vidia-container-toolkit-base (#941)
* Update TF for closed network option * including creating new vpc, subnets, etc in closed network * rm simple copy * removed uncessary image copies from feedback
* enabling custom labels and taints * revert to og
* Update install_docker.sh for containerd configuration ## Summary Fixes sed regex in containerd root configuration to use correct single backslash `\?` instead of double backslash `\\?`. ## Problem The double backslash `\\?` in the sed pattern looks for a literal backslash character, not an optional `#`. This prevents the containerd root configuration from being uncommented and updated. ## Solution Changed sed pattern from `^#\\?root` to `^#\?root` to correctly match optional `#` character. ## Testing Verified on live cluster that: - Double backslash `\\?` fails: `#root = "/var/lib/containerd"` (stays commented) - Single backslash `\?` works: `root = "/opt/sagemaker/containerd/data-root"` (uncommented and updated) ## Changes - Line 84: Containerd config for `/opt/sagemaker` with correct sed - Line 101: Containerd config for `/opt/dlami/nvme` with correct sed Fixes the issue reported in PR #914. * Fix containerd path naming consistency and sed regex - Use consistent naming: containerd/data-root for both paths - Fix sed regex: use \? instead of \? to match optional # - Minimal changes: just sed commands without extra logic - Addresses feedback from PR #914 * Add containerd restart to apply config changes Containerd only reads config at startup, so restart is required to apply the new root path. Verified containerd data is written to the new location after restart.
* Upgrade dependencies in nccl-tests Dockerfile Updated CUDA, EFA, AWS OFI NCCL, NCCL, and NCCL tests versions in the Dockerfile. Update EFA installer to 1.45.1 which supports https://github.com/aws/aws-ofi-nccl/releases/tag/v1.17.2 > Upgrade to libnccl-ofi 1.17.2 (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-changelog.html) EFA_INSTALLER_VERSION=1.45.1 AWS_OFI_NCCL_VERSION=1.17.2 Update NCCL to v2.28.7-1, the latest version supported by aws-ofi-nccl 1.17.2. https://github.com/aws/aws-ofi-nccl/releases > The 1.17.2 release series supports NCCL v2.28.7-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later). > With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric v2.3.1amzn1.0. Update CUDA to 12.9.1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ -> Provides https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libnccl2_2.28.7-1+cuda12.9_amd64.deb * Update NCCL and related package versions Updated versions for EFA installer, NCCL, and NCCL tests. --------- Co-authored-by: Pavel Belevich <belevich@amazon.com>
…#938) * Support new event type "SageMaker HyperPod Cluster Event" * Refer to new workshop page
- Adopt Pythonic idioms (is None, exception handling) - Use defaultdict to simplify nested dictionary initialization - Use context managers for cleaner file handling Signed-off-by: Nathan Na <nzhenye@amazon.com>
* Fix version parsing logic of ofi nccl plugin Signed-off-by: Nathan Na <nzhenye@amazon.com> * docs: fix typos and improve formatting consistency --------- Signed-off-by: Nathan Na <nzhenye@amazon.com>
* ignore venv * make ddp test case compatible with CPU/GPU and add mlflow support * make the test case compatible with Managed MLFlow * update * Update 3.test_cases/pytorch/ddp/slurm/0.create-venv.sh * Fix missing pip install in Dockerfile Add pip install command for mlflow and sagemaker-mlflow packages * Add MLflow error handling and update dependencies for Ubuntu 22.04 - Wrap MLflow calls in try/except so training continues if tracking is unavailable - Remove Ubuntu 20.04 Python version check from create-venv.sh - Update torch 2.1.1->2.10.0, torchvision 0.16.1->0.25.0, unpin numpy
) - Remove --use-mlflow from TORCHRUN_ARGS in container sbatch (crashes torchrun) - Fix undefined ${ENROOT_IMAGE} variable in enroot image script - Fix Kubernetes template: rename fsdp→ddp, fix torchrun path, fix positional args - Update READMEs: replace stale conda/fsdp references with venv/ddp - Fix MLflow default URI documentation to match actual code default - Fix script filenames in READMEs to match actual files on disk
Updated CODEOWNERS to reflect new team for approvals.
* FSDP2: updated files from the sample repo * updated sbatch files with conda activate command * Refactor FSDP training configuration - Downgrade PyTorch from 2.9.1 to 2.7.0 - Split Dockerfile into multi-stage build (Base and HTPO stages) - Remove venv activation from Slurm training scripts - Reverting Kubernetes template configuration * fixing typo HTPO to HPTO * Docker Target does not accept upercase so changed it Lowercase base or hpto and 2.7.0 is not supported so switching it to 2.6.0 for FSDP2 and Hyperpod Elastic agent * Update PyTorch and related package versions Let's come back to this in a separate PR. For now, let me merge the PR before we address #959 --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* NeMo 2 Performance instructions * Update PERFORMANCE.md * Update PERFORMANCE.md * Address review feedback for NeMo 2 Performance instructions - Add Table of Contents and improve document structure - Add Prerequisites section with NeMo version compatibility matrix - Update EFA installation instructions with links to AWS docs - Add Environment Variables configuration section - Add Expected Outputs section explaining performance metrics - Add Multi-Node Distributed Training section with examples - Fix 'error' placeholder with explanatory note about B200 configuration - Add section headers for Mixtral, Nemotron, and DeepSeek models - Improve Fine-Tuning section with better formatting and HF_TOKEN note - Add comprehensive Troubleshooting section - Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile Fixes review comments from nghtm and KeitaW * Update Slurm workflows to use GitHub-hosted runners with SSH - Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH - Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH - Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH - Add AWS OIDC authentication for all workflows - Add real-time log streaming from p5en.smml.aiml.aws.dev cluster - Add SSH retry logic and job cancellation on workflow abort - Implement enroot image cleanup after test completion - Add new pr-review-and-slurm-test.yml for comprehensive PR testing - Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs --------- Co-authored-by: Pavel Belevich <belevichp@gmail.com> Co-authored-by: Pavel Belevich <belevich@amazon.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* Fix typo in val_batch_size and remove unused imports Fix typo in val_batch_size and remove unused imports * Fix typo in val_batch_size and remove unused imports --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
The RLVR test case with MTC support is specific to HyperPod EKS and not generally useful for all EKS users. Move the directory to make this distinction clear in the repo structure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3.test_cases/pytorch/verl/rlvr/to3.test_cases/pytorch/verl/hyperpod-eks/rlvr/to clarify that this test case is specific to HyperPod EKS, not general EKS usageContext
Part of cleanup effort for #959. The RLVR + MTC content is HyperPod EKS specific and the directory structure should reflect that.
Test plan