Skip to content

Commit d9bf867

Browse files
authored
Merge branch 'main' into update-slurm-workflows-ssh
2 parents 198b726 + 223c9e9 commit d9bf867

File tree

286 files changed

+19314
-4628
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

286 files changed

+19314
-4628
lines changed

.github/CODEOWNERS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# See https://help.github.com/articles/about-codeowners/
2+
3+
4+
# Enforces that a member of the @aws-samples/sagemaker-hyperpod-dev team for HyperPod lifecycle scripts
5+
# They must approve any PRs that modify files under either base-config directory,
6+
# including all nested subdirectories and files.
7+
/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config @aws-samples/hyperpod-lcs-dev
8+
/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config @aws-samples/hyperpod-lcs-dev

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,10 @@ ENV/
120120
env.bak/
121121
venv.bak/
122122

123+
# User-specific environment variables (keep .example files)
124+
**/env_vars
125+
!**/env_vars.example
126+
123127
# IDE
124128
.c9/
125129
.idea/
@@ -153,3 +157,13 @@ wandb/
153157

154158
# Enroot container image
155159
*.sqsh
160+
161+
*.csv
162+
submitted_jobs*.txt
163+
topo_sorted_hostnames.txt
164+
micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests/hostnames.txt
165+
*.patch
166+
micro-benchmarks/nccl-tests/slurm/find_bad_nodes/logs/analysis_summary_*.txt
167+
micro-benchmarks/nccl-tests/slurm/find_bad_nodes/logs/node_combinations_*.txt
168+
.testenv
169+
micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests/debug_topology.json

1.architectures/0.common/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,6 @@ This template creates a S3 Bucket with all public access disabled. To deploy it,
99

1010
## HyperPod cluster status change / node health event notifications
1111

12-
This template deploys a stack to receive human-readable email notifications for HyperPod cluster status changes and node health events. See the [workshop page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/26-event-bridge) for more details.
12+
This template deploys a stack to receive human-readable email notifications for HyperPod cluster status changes and node health events. See the [workshop page](https://awslabs.github.io/ai-on-sagemaker-hyperpod/docs/Tips/EKS/event-bridge) for more details.
1313

1414
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://ws-assets-prod-iad-r-iad-ed304a55c2ca1aee.s3.us-east-1.amazonaws.com/e3752eec-63b5-4033-9720-fa68d35164e9/hyperpod-event-bridge-email.yaml&stackName=hyperpod-event-bridge-email)

1.architectures/0.common/hyperpod-event-bridge-email.yaml

Lines changed: 92 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -102,17 +102,29 @@ Resources:
102102
ZipFile: |
103103
import os
104104
import json
105-
105+
from datetime import datetime
106106
import boto3
107107
108+
def format_event_time(event_time):
109+
try:
110+
dt = datetime.fromtimestamp(int(event_time) / 1000)
111+
return dt.strftime("%Y-%m-%d %H:%M:%S UTC")
112+
except:
113+
return event_time
108114
109115
def get_console_url(event):
110116
region = event["region"]
111-
cluster_name = event["detail"]["ClusterName"]
117+
118+
if "ClusterName" in event["detail"]:
119+
cluster_name = event["detail"]["ClusterName"]
120+
elif "EventDetails" in event["detail"]:
121+
cluster_name = event["detail"]["EventDetails"]["ClusterName"]
122+
else:
123+
cluster_name = ""
124+
112125
console_url = f"https://{region}.console.aws.amazon.com/sagemaker/home?region={region}#/cluster-management/{cluster_name}"
113126
return console_url
114127
115-
116128
def format_html_for_cluster_status_event(event):
117129
118130
html_body = '<body style="font-family:Helvetica; font-size: 11pt;">\n'
@@ -173,7 +185,6 @@ Resources:
173185
174186
return html_body
175187
176-
177188
def format_html_for_node_health_event(event):
178189
179190
html_body = '<body style="font-family:Helvetica; font-size: 11pt;">\n'
@@ -240,7 +251,76 @@ Resources:
240251
241252
return html_body
242253
254+
def format_html_for_cluster_event(event):
255+
"""Format HTML for generic cluster events with EventDetails."""
256+
257+
event_details = event["detail"]["EventDetails"]
258+
259+
html_body = '<body style="font-family:Helvetica; font-size: 11pt;">\n'
260+
261+
# Table of summary
262+
html_body += '<table>\n'
263+
264+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
265+
"AWS account:",
266+
event["account"]
267+
)
268+
269+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
270+
"Region:",
271+
event["region"]
272+
)
273+
274+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
275+
"Cluster name:",
276+
event_details.get("ClusterName", "N/A")
277+
)
278+
279+
html_body += '</table>\n'
243280
281+
html_body += "<br>\n"
282+
283+
# Table of event details
284+
html_body += '<table border="1" >\n'
285+
html_body += '<caption>Event details</caption>\n'
286+
287+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
288+
"Resource Type",
289+
event_details.get("ResourceType", "N/A")
290+
)
291+
292+
if "InstanceGroupName" in event_details:
293+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
294+
"Instance Group",
295+
event_details["InstanceGroupName"]
296+
)
297+
298+
if "InstanceId" in event_details:
299+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
300+
"Instance ID",
301+
event_details["InstanceId"]
302+
)
303+
304+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
305+
"Event Time",
306+
format_event_time(event_details.get("EventTime", "N/A"))
307+
)
308+
309+
html_body += "<tr> <td>%s</td> <td>%s</td> </tr>\n" % (
310+
"Description",
311+
event_details.get("Description", "N/A")
312+
)
313+
314+
html_body += '</table>\n'
315+
316+
html_body += "<br>\n"
317+
318+
# Hyperlink to console page
319+
html_body += '<a href="%s">Link to HyperPod console</a>\n' % get_console_url(event)
320+
321+
html_body += "</body>"
322+
323+
return html_body
244324
245325
def lambda_handler(event, context):
246326
ses = boto3.client('ses')
@@ -257,6 +337,12 @@ Resources:
257337
node_status = event["detail"]["HealthSummary"]["HealthStatus"]
258338
email_subject = f"HyperPod Cluster Node Health Event - {node_status}"
259339
email_body = format_html_for_node_health_event(event)
340+
elif event_type == "SageMaker HyperPod Cluster Event":
341+
event_details = event["detail"]["EventDetails"]
342+
cluster_name = event_details.get("ClusterName", "Unknown")
343+
resource_type = event_details.get("ResourceType", "Unknown")
344+
email_subject = f"HyperPod Cluster Event - {cluster_name} - {resource_type}"
345+
email_body = format_html_for_cluster_event(event)
260346
else:
261347
assert False, f"Unknown event type {event_type}"
262348
@@ -295,7 +381,8 @@ Resources:
295381
"source": ["aws.sagemaker"],
296382
"detail-type": [
297383
"SageMaker HyperPod Cluster State Change",
298-
"SageMaker HyperPod Cluster Node Health Event"
384+
"SageMaker HyperPod Cluster Node Health Event",
385+
"SageMaker HyperPod Cluster Event"
299386
]
300387
}
301388
Targets:
Lines changed: 151 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,174 @@
1-
# AWS Batch distributed training architectures
1+
# AWS Batch Distributed Training Architectures
22

3-
This architecture serves as an example to run distributed training jobs on p4d.24xlarge instances but can be easily be modified to accommodate other instance kinds (Trn or other P instances).
3+
This repository provides CloudFormation templates and examples for running distributed training jobs on AWS Batch using GPU instances. The architecture can be easily modified to accommodate different instance types including Trainium (Trn) and other P-series instances.
44

5-
> **Important**: it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) as our Batch template will fetch automatically the EFA Security Group ID (SG) and Subnet ID to setup the AWS Batch Compute Environment. Both the SG and Subnet are exported values from the VPC template.
5+
## Table of Contents
66

7-
This architecture consists of the following resources:
7+
- [Prerequisites](#prerequisites)
8+
- [Architecture Overview](#architecture-overview)
9+
- [Available Templates](#available-templates)
10+
- [P4 Instance Deployment](#p4-instance-deployment)
11+
- [P5 Instance Deployment](#p5-instance-deployment)
12+
- [P6 Instance Deployment](#p6-instance-deployment)
13+
- [Important Considerations](#important-considerations)
814

9-
- [AWS Batch Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) for [Multi-node parallel jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html). It is similar to a compute cluster.
10-
- [AWS Batch Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) attached to the compute environment. It is similar to a queue for job schedulers (Slurm, LSF...).
11-
- [EC2 Launch Template](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) which used to setup 4 EFA cards on our instance.
12-
- [Job Definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) serves as a template for our jobs and refers to the container registry to pull containers
13-
- [ECR Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) is used to store containers.
15+
## Prerequisites
1416

15-
## Template
17+
> **⚠️ Important**: You must deploy the VPC template [`2.vpc-one-az.yaml`](../../1.architectures/1.vpc_network/2.vpc-one-az.yaml) before deploying any Batch template. The Batch templates automatically fetch the EFA Security Group ID and Subnet ID from the VPC template's exported values.
1618
17-
This template deploys AWS Batch and EC2 resources. It can be deployed via the console and the AWS CLI. Regardless of the deployment method it is assumed that you deployed the VPC template [`2.vpc-one-az.yaml`](../0.vpc_network/2.vpc-oneaz.yaml) prior to deploying that one.
19+
## Architecture Overview
1820

19-
- **Template file**: [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml)
21+
This architecture consists of the following AWS resources:
2022

21-
### Quick Create
23+
| Component | Purpose | Documentation |
24+
|-----------|---------|---------------|
25+
| **AWS Batch Compute Environment** | Manages compute resources for multi-node parallel jobs (similar to a compute cluster) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) |
26+
| **AWS Batch Job Queue** | Queues jobs for execution (similar to Slurm/LSF schedulers) | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) |
27+
| **EC2 Launch Template** | Configures EFA network interfaces for high-performance networking | [AWS Docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-templates.html) |
28+
| **Job Definition** | Template for job execution, references container images | [AWS Docs](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) |
29+
| **ECR Container Registry** | Stores Docker container images | [AWS Docs](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) |
2230

23-
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch)
31+
<img src="../../0.docs/batch-arch.png" width="600" alt="AWS Batch Architecture Diagram">
2432

33+
## Available Templates
2534

26-
## List of Parameters
35+
| Template | Instance Types | Features |
36+
|----------|----------------|----------|
37+
| [`0.aws-batch-distributed-training.yaml`](./0.aws-batch-distributed-training.yaml) | P4d.24xlarge (default) | Standard deployment with 4 EFA interfaces |
38+
| [`0.aws-batch-distributed-training-p5.yaml`](./0.aws-batch-distributed-training-p5.yaml) | P5.48xlarge | Optimized for P5 instances |
39+
| [`aws-batch-distributed-training-p6.yaml`](./aws-batch-distributed-training-p6.yaml) | P6-b200.48xlarge | P6 deployment with sample AWS Batch MNP job setup |
2740

28-
The templates takes parameters that are mandatory and optional, see below for more details.
41+
## P4 Instance Deployment
2942

30-
| Name | Type | Details |
31-
|-------------------------|-------------|-----------------------------------------------------------------------|
32-
| `VPCStackParameter` | Required | Name of the VPC stack in CloudFormation. |
33-
| `AMI` | Optional | ID of the AMI if using a custom one otherwise leave blank |
34-
| `CapacityReservationId` | Optional | Use that or the ResourceGroup to refer to an EC2 reservation |
35-
| `CapacityReservationResourceGroup` | Optional | Use that or the CapacityReservationId. |
36-
| `EC2KeyPair` | Optional | EC2 key pair to use in case you want to connect through ssh for debug.|
37-
| `CreatePlacementGroup` | Optional | Create a placement group for the instances. |
43+
### Quick Deploy
3844

45+
Deploy the standard template with one click:
3946

40-
## Deploy with the AWS CLI
47+
[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/0.aws-batch-distributed-training.yaml&stackName=AWS-Batch)
4148

42-
If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration.
49+
### Parameters
50+
51+
| Parameter | Type | Description |
52+
|-----------|------|-------------|
53+
| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack |
54+
| `AMI` | Optional | Custom AMI ID (leave blank for default) |
55+
| `CapacityReservationId` | Optional | EC2 Capacity Reservation ID |
56+
| `CapacityReservationResourceGroup` | Optional | Alternative to CapacityReservationId |
57+
| `EC2KeyPair` | Optional | EC2 key pair for SSH debugging |
58+
| `CreatePlacementGroup` | Optional | Create placement group for instances |
59+
60+
### P5 Instance Deployment
4361

4462
```bash
45-
aws cloudformation create-stack --stack-name aws-batch-p5 \
46-
--template-body file://0.aws-batch-distributed-training-p5.yaml \
47-
--parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
48-
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
49-
--capabilities CAPABILITY_NAMED_IAM
63+
aws cloudformation create-stack \
64+
--stack-name aws-batch-distributed-training \
65+
--template-body file://0.aws-batch-distributed-training.yaml \
66+
--parameters \
67+
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
68+
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
69+
--capabilities CAPABILITY_NAMED_IAM
5070
```
5171

52-
## Gotchas
72+
## P6 Instance Deployment
73+
74+
### Template Parameters
75+
76+
| Parameter | Type | Description |
77+
|-----------|------|-------------|
78+
| `VPCStackParameter` | **Required** | Name of the VPC CloudFormation stack |
79+
| `CapacityReservationId` | **Required** | Capacity Reservation ID (e.g., cr-1234567890) |
80+
81+
### Deployment Steps
82+
83+
#### Step 1: Deploy CloudFormation Stack
84+
85+
```bash
86+
aws cloudformation create-stack \
87+
--stack-name batch-p6 \
88+
--template-body file://aws-batch-distributed-training-p6.yaml \
89+
--parameters \
90+
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
91+
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
92+
--capabilities CAPABILITY_NAMED_IAM
93+
```
94+
95+
#### Step 2: Generate and Store SSH Key
96+
97+
```bash
98+
# Generate SSH key pair
99+
ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key
100+
101+
# Store private key in Secrets Manager
102+
aws secretsmanager put-secret-value \
103+
--secret-id batch-p6-ssh-key \
104+
--secret-string file:///tmp/batch_key
105+
106+
# Clean up temporary files
107+
rm /tmp/batch_key /tmp/batch_key.pub
108+
```
109+
110+
### Testing Your Deployment
111+
112+
Submit a multi-node NCCL test job to verify the setup:
113+
114+
```bash
115+
# Retrieve stack outputs
116+
JOB_DEFINITION=$(aws cloudformation describe-stacks \
117+
--stack-name batch-p6 \
118+
--query 'Stacks[0].Outputs[?OutputKey==`JobDefinitionMultiInstance`].OutputValue' \
119+
--output text)
120+
121+
JOB_QUEUE=$(aws cloudformation describe-stacks \
122+
--stack-name batch-p6 \
123+
--query 'Stacks[0].Outputs[?OutputKey==`DistributedDeepLearningJQ`].OutputValue' \
124+
--output text)
125+
126+
# Submit test job
127+
aws batch submit-job \
128+
--job-name nccl-test-2node \
129+
--job-queue ${JOB_QUEUE} \
130+
--job-definition ${JOB_DEFINITION} \
131+
--node-overrides numNodes=2
132+
133+
# Monitor job status
134+
aws batch describe-jobs --jobs <job-id>
135+
136+
# View logs
137+
aws logs tail /aws/batch/job --follow
138+
```
139+
140+
### P6 Architecture Details
141+
142+
- **Container Image**: `public.ecr.aws/hpc-cloud/nccl-tests:latest`
143+
- **Network Configuration**: 8 EFA interfaces per instance
144+
- **SSH Setup**: Automated via inline bash script in Job Definition
145+
- **Default Test**: `all_reduce_perf` with 8 GPUs per node (16 total processes for 2-node job)
146+
- **Key Management**: SSH keys retrieved from Secrets Manager at container startup
147+
148+
## Important Considerations
149+
150+
### EFA Network Configuration
151+
152+
- EFA interfaces must be explicitly declared in the EC2 Launch Template
153+
- The EFA security group must be provided and properly configured
154+
- Network performance is critical for distributed training workloads
155+
156+
### VPC Dependencies
157+
158+
- The Compute Environment retrieves private subnet information from the VPC template
159+
- Ensure the VPC template exports the required subnet and security group values
160+
- Both templates must be deployed in the same AWS region
161+
162+
### Capacity Management
163+
164+
- Use Capacity Reservations for guaranteed instance availability
165+
- Consider using Capacity Reservation Resource Groups for easier management
166+
- Monitor your EC2 limits and request increases if needed
53167

54-
There are a few things to know as you evaluate this architecture:
55-
- EFA interfaces need to be declared explicitly in the EC2 Launch Template and you need to provide the security group used for EFA.
56-
- The Compute Environment must retrieve the list of private subnets from the VPC template. This list is exported by the VPC template.
57-
- The Batch Job Definition assumes you are pushing a container with `stress-ng` and is pre-configured as such.
168+
---
58169

59-
## Architecture Diagram
170+
## Additional Resources
60171

61-
<img src="../../0.docs/batch-arch.png" width="500">
172+
- [AWS Batch User Guide](https://docs.aws.amazon.com/batch/latest/userguide/)
173+
- [Multi-node Parallel Jobs](https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html)
174+
- [EFA Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html)

0 commit comments

Comments
 (0)