Skip to content

Commit a306252

Browse files
authored
Added training plan logic into automate script (#510)
Training plans logic added in to automate script.
1 parent f1eaede commit a306252

File tree

3 files changed

+234
-20
lines changed

3 files changed

+234
-20
lines changed
Lines changed: 91 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,94 @@
1-
# Automate SageMaker HyperPod Cluster Creation
2-
A bash script that automates the manual cluster creation process for SageMaker HyperPod SLURM
1+
# SageMaker Hyperpod Cluster Automation Script
32

4-
This automates the steps from the [SageMaker HyperPod SLURM Workshop](https://catalog.workshops.aws/sagemaker-hyperpod/en-US)
3+
This project provides a script to automate the creation and setup of a SageMaker Hyperpod cluster with SLURM integration.
54

6-
## 🚀 Installation and Usage
7-
Using this script is very simple. Run ```bash automate-cluster-creation.sh```
5+
The automation script streamlines the process of setting up a distributed training environment using AWS SageMaker Hyperpod.
6+
It handles the installation and configuration of necessary tools, clones the required repository, sets up environment variables, and configures lifecycle scripts for the SageMaker Hyperpod architecture.
87

9-
The script will walk you through creating the cluster configuration for your SageMaker HyperPod Slurm cluster. Please read through the instructions provided while running the script for the best experience.
8+
## Demo
9+
10+
![SageMaker Hyperpod Cluster Automation Demo](/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/media/automate-smhp-demo.gif)
11+
12+
This demo gif showcases the step-by-step process of creating and setting up a SageMaker Hyperpod cluster using our automation script.
13+
14+
- `automate-cluster-creation.sh`: The main script that automates the cluster creation process.
15+
- `README.md`: This file, providing information about the project.
16+
17+
## Usage Instructions
18+
19+
### Prerequisites
20+
21+
- AWS CLI (version 2.17.1 or higher)
22+
- Git
23+
- Bash shell environment
24+
- AWS account with appropriate permissions
25+
26+
### Installation
27+
28+
1. Clone this repository:
29+
```bash
30+
git clone https://github.com/aws-samples/awsome-distributed-training.git
31+
cd 1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm
32+
```
33+
34+
2. Make the script executable:
35+
```bash
36+
chmod +x automate-cluster-creation.sh
37+
```
38+
39+
### Running the Script
40+
41+
Execute the script:
42+
43+
```bash
44+
./automate-cluster-creation.sh
45+
```
46+
47+
The script will guide you through the following steps:
48+
49+
1. Check and install/update AWS CLI if necessary.
50+
2. Verify Git installation.
51+
3. Clone the "awsome-distributed-training" repository.
52+
4. Set up environment variables.
53+
5. Configure lifecycle scripts for SageMaker Hyperpod.
54+
55+
### Configuration
56+
57+
During execution, you'll be prompted to provide the following information:
58+
59+
- Name of the SageMaker VPC CloudFormation stack (default: sagemaker-hyperpod)
60+
- Confirmation if you deployed the optional hyperpod-observability CloudFormation stack
61+
- Instance group configuration (group name, instance type, instance count, etc)
62+
63+
### Troubleshooting
64+
65+
- If you encounter permission issues when attaching IAM policies, the script will provide options to:
66+
1. Run `aws configure` as an admin user within the script.
67+
2. Exit the script to run `aws configure` manually.
68+
3. Continue without configuring this step.
69+
70+
- If environment variable generation fails:
71+
1. You can choose to continue with the rest of the script (not recommended unless you know how to set the variables manually).
72+
2. Exit the script to troubleshoot the issue.
73+
74+
## Data Flow
75+
76+
The automation script follows this general flow:
77+
78+
1. Check and setup prerequisites (AWS CLI, Git)
79+
2. Clone necessary repositories
80+
3. Set up environment variables
81+
4. Configure lifecycle scripts
82+
5. Enable observability (if applicable)
83+
6. Attach IAM policies (if applicable)
84+
7. Cluster Configuration
85+
8. Create cluster
86+
87+
```
88+
[Prerequisites] -> [Clone Repos] -> [Setup Env Vars] -> [Configure LCS] -> [Enable Observability] -> [Attach IAM Policies] -> [Create Cluster configuration] -> [Create cluster]
89+
```
90+
91+
Important technical considerations:
92+
- Ensure you have the necessary AWS permissions before running the script.
93+
- The script modifies the `config.py` file to enable observability if selected.
94+
- IAM policy attachment requires admin permissions.

1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh

Lines changed: 143 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -198,8 +198,8 @@ setup_lifecycle_scripts() {
198198

199199
# Helper function for attaching IAM policies (specific to observability stack only!)
200200
attach_policies() {
201-
aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
202-
aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
201+
aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess --output json
202+
aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess --output json
203203
}
204204

205205
# Capture stdout + stderr
@@ -254,7 +254,7 @@ setup_lifecycle_scripts() {
254254
echo -e "${BLUE}Uploading your lifecycle scripts to S3 bucket ${YELLOW}${BUCKET}${NC}"
255255
# upload data
256256
upload_to_s3() {
257-
aws s3 cp --recursive base-config/ s3://${BUCKET}/src
257+
aws s3 cp --recursive base-config/ s3://${BUCKET}/src --output json
258258
}
259259

260260
if error_output=$(upload_to_s3 2>&1); then
@@ -355,19 +355,22 @@ create_config() {
355355
echo -e "\n${BLUE}=== Worker Group Configuration ===${NC}"
356356
while true; do
357357
if [[ $WORKER_GROUP_COUNT -eq 1 ]]; then
358-
echo -e "${GREEN}Do you want to add a worker instance group? (yes/no): ${NC}"
358+
ADD_WORKER=$(get_input "Do you want to add a worker instance group? (yes/no):" "yes")
359359
else
360-
echo -e "${GREEN}Do you want to add another worker instance group? (yes/no): ${NC}"
360+
ADD_WORKER=$(get_input "Do you want to add another worker instance group? (yes/no):" "no")
361361
fi
362-
read -e ADD_WORKER
362+
363363
if [[ $ADD_WORKER != "yes" ]]; then
364364
break
365365
fi
366366

367367
echo -e "${YELLOW}Configuring Worker Group $WORKER_GROUP_COUNT${NC}"
368368
INSTANCE_TYPE=$(get_input "Enter the instance type for worker group $WORKER_GROUP_COUNT" "ml.c5.4xlarge")
369369
INSTANCE_COUNT=$(get_input "Enter the instance count for worker group $WORKER_GROUP_COUNT" "4")
370-
370+
371+
echo -e "${GREEN}Are you using training plans? (yes/no): ${NC}"
372+
read -e USE_TRAINING_PLAN
373+
371374
INSTANCE_GROUPS+=",
372375
{
373376
\"InstanceGroupName\": \"worker-group-$WORKER_GROUP_COUNT\",
@@ -387,7 +390,132 @@ create_config() {
387390
\"ExecutionRole\": \"${ROLE}\",
388391
\"ThreadsPerCore\": 1"
389392

390-
# More coming Re:Invent 2024!!!
393+
if [[ $USE_TRAINING_PLAN == "yes" ]]; then
394+
echo -e "\n${BLUE}=== Training Plan Configuration ===${NC}"
395+
# aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess
396+
397+
TRAINING_PLAN=$(get_input "Enter the training plan name" "")
398+
399+
count=0
400+
while true; do
401+
# Attempt to describe the training plan
402+
echo -e "${YELLOW}Attempting to retrieve training plan details...${NC}"
403+
404+
if ! TRAINING_PLAN_DESCRIPTION=$(aws sagemaker describe-training-plan --training-plan-name "$TRAINING_PLAN" --output json 2>&1); then
405+
echo -e "${BLUE}❌Error: Training plan '$TRAINING_PLAN' not found. Please try again.${NC}"
406+
echo -e "${GREEN}Are you using training plans (Beta feature)? (yes/no)${NC}"
407+
read -e USE_TRAINING_PLAN
408+
if [[ $USE_TRAINING_PLAN != "yes" ]]; then
409+
echo -e "${YELLOW}Exiting training plan configuration.${NC}"
410+
break
411+
else
412+
TRAINING_PLAN=$(get_input "Enter the training plan name" "")
413+
fi
414+
else
415+
# Extract relevant information from the description
416+
TRAINING_PLAN_ARN=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TrainingPlanArn')
417+
AVAILABLE_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.AvailableInstanceCount')
418+
TOTAL_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TotalInstanceCount')
419+
TRAINING_PLAN_AZ=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.ReservedCapacitySummaries[0].AvailabilityZone')
420+
TP_INSTANCE_TYPE=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.ReservedCapacitySummaries[0].InstanceType')
421+
422+
CF_AZ=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID --output json | jq -r '.Subnets[0].AvailabilityZone')
423+
424+
# Only print if count=0
425+
if [[ $count -eq 0 ]]; then
426+
echo -e "${GREEN}Training Plan Details:${NC}"
427+
echo -e " ${YELLOW}Name:${NC} $TRAINING_PLAN"
428+
echo -e " ${YELLOW}Available Instance Count:${NC} $AVAILABLE_INSTANCE_COUNT"
429+
echo -e " ${YELLOW}Total Instance Count:${NC} $TOTAL_INSTANCE_COUNT"
430+
echo -e " ${YELLOW}Training Plan Availability Zone:${NC} $TRAINING_PLAN_AZ"
431+
echo -e " ${YELLOW}Training Plan Instance Type:${NC} $TP_INSTANCE_TYPE"
432+
fi
433+
434+
# Compare INSTANCE_COUNT with AVAILABLE_INSTANCE_COUNT
435+
INSTANCE_COUNT_OK="n"
436+
if [[ $INSTANCE_COUNT -gt $AVAILABLE_INSTANCE_COUNT ]]; then
437+
echo -e "${YELLOW}Warning: The requested instance count ($INSTANCE_COUNT) is greater than the available instances in the training plan ($AVAILABLE_INSTANCE_COUNT).${NC}"
438+
echo -e "${BLUE}Do you want to continue anyway?(yes/no)${NC}"
439+
read -e CONTINUE
440+
if [[ $CONTINUE != "yes" ]]; then
441+
NEW_INSTANCE_COUNT=$(get_input "Enter the new number of instances" "1")
442+
# Update INSTANCE_GROUPS with new INSTANCE_COUNT for the current worker group
443+
INSTANCE_GROUPS=$(echo "$INSTANCE_GROUPS" | perl -pe '
444+
BEGIN {
445+
$group = "worker-group-'"$WORKER_GROUP_COUNT"'";
446+
$count = '"$NEW_INSTANCE_COUNT"';
447+
$in_group = 0;
448+
}
449+
if (/"InstanceGroupName":\s*"$group"/) {
450+
$in_group = 1;
451+
}
452+
if ($in_group && /"InstanceCount":\s*\d+/) {
453+
s/("InstanceCount":\s*)\d+/$1$count/;
454+
$in_group = 0;
455+
}
456+
')
457+
INSTANCE_COUNT=$NEW_INSTANCE_COUNT
458+
echo -e "${GREEN}Updated instance count for worker-group-$WORKER_GROUP_COUNT to $INSTANCE_COUNT${NC}"
459+
fi
460+
INSTANCE_COUNT_OK="y"
461+
else
462+
INSTANCE_COUNT_OK="y"
463+
fi
464+
465+
if [[ $INSTANCE_COUNT_OK == "y" ]]; then
466+
INSTANCE_TYPE_OK="n"
467+
# Compare INSTANCE_TYPE with TP_INSTANCE_TYPE
468+
if [[ $INSTANCE_TYPE != $TP_INSTANCE_TYPE ]]; then
469+
echo -e "${YELLOW}Warning: The requested instance type ($INSTANCE_TYPE) does not match the instance type in the training plan ($TP_INSTANCE_TYPE).${NC}"
470+
echo -e "${BLUE}Do you want to continue anyway? If you choose "no", then the script will update instance type for you and proceed. (yes/no)${NC}"
471+
read -e CONTINUE
472+
if [[ $CONTINUE != "yes" ]]; then
473+
NEW_INSTANCE_TYPE=$TP_INSTANCE_TYPE
474+
# Update INSTANCE_GROUPS with new INSTANCE_TYPE for the current worker group
475+
INSTANCE_GROUPS=$(echo "$INSTANCE_GROUPS" | perl -pe '
476+
BEGIN {
477+
$group = "worker-group-'$WORKER_GROUP_COUNT'";
478+
$type = "'$NEW_INSTANCE_TYPE'";
479+
$in_group = 0;
480+
}
481+
if (/"InstanceGroupName":\s*"$group"/) {
482+
$in_group = 1;
483+
}
484+
if ($in_group && /"InstanceType":\s*"[^"]*"/) {
485+
s/("InstanceType":\s*")[^"]*"/$1$type"/;
486+
$in_group = 0;
487+
}
488+
')
489+
INSTANCE_TYPE=$NEW_INSTANCE_TYPE
490+
echo -e "${GREEN}Updated instance type for worker-group-$WORKER_GROUP_COUNT to $INSTANCE_TYPE${NC}"
491+
fi
492+
INSTANCE_TYPE_OK="y"
493+
else
494+
INSTANCE_TYPE_OK="y"
495+
fi
496+
497+
if [[ $INSTANCE_TYPE_OK == "y" ]]; then
498+
# Compare TRAINING_PLAN_AZ with CF_AZ
499+
if [[ $TRAINING_PLAN_AZ != $CF_AZ ]]; then
500+
echo -e "${YELLOW}Warning: The training plan availability zone ($TRAINING_PLAN_AZ) does not match the cluster availability zone ($CF_AZ).${NC}"
501+
echo -e "${BLUE}Do you want to continue anyway? (yes/no)${NC}"
502+
read -e CONTINUE
503+
if [[ $CONTINUE != "yes" ]]; then
504+
echo -e "${YELLOW}Please ensure that your VPC is in the same Availability Zone as your training plan (or vice versa). If you used the workshop, this should be the CF stack \"sagemaker-hyperpod\". Exiting training plan configuration.${NC}"
505+
continue
506+
fi
507+
fi
508+
fi
509+
fi
510+
511+
echo -e "${GREEN}Adding Training Plan ARN to instance group configuration.${NC}"
512+
INSTANCE_GROUPS+=",
513+
\"TrainingPlanArn\": \"$TRAINING_PLAN_ARN\""
514+
break
515+
fi
516+
count+=1
517+
done
518+
fi
391519

392520
INSTANCE_GROUPS+="
393521
}"
@@ -468,7 +596,7 @@ EOL
468596

469597
# upload data
470598
upload_to_s3() {
471-
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
599+
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/ --output json
472600
}
473601

474602
if error_output=$(upload_to_s3 2>&1); then
@@ -557,7 +685,8 @@ create_cluster() {
557685

558686
if ! output=$(aws sagemaker create-cluster \
559687
--cli-input-json file://cluster-config.json \
560-
--region $AWS_REGION 2>&1); then
688+
--region $AWS_REGION \
689+
--output json 2>&1); then
561690

562691
echo -e "${YELLOW}⚠️ Error occurred while creating the cluster:${NC}"
563692
echo -e "${YELLOW}$output${NC}"
@@ -569,7 +698,7 @@ create_cluster() {
569698
# Command to create the cluster
570699
echo -e "${GREEN} aws sagemaker create-cluster \\"
571700
echo -e "${GREEN} --cli-input-json file://cluster-config.json \\"
572-
echo -e "${GREEN} --region $AWS_REGION${NC}\n"
701+
echo -e "${GREEN} --region $AWS_REGION --output json${NC}\n"
573702

574703
read -e -p "Select an option (Enter/Ctrl+C): " choice
575704

@@ -603,7 +732,6 @@ goodbye() {
603732
echo -e "\n${BLUE}Exiting script. Good luck with your SageMaker HyperPod journey! 👋${NC}\n"
604733
}
605734

606-
607735
#===Main Script===
608736
main() {
609737
print_header "🚀 Welcome to the SageMaker HyperPod Cluster Creation Script! 🚀"
@@ -672,7 +800,8 @@ main() {
672800
echo -e "${GREEN}Congratulations! You've completed all the preparatory steps.${NC}"
673801
echo -e "${YELLOW}Next Steps:${NC}"
674802

675-
read -e -p "Do you want the script to create the cluster for you now? (yes/no): " CREATE_CLUSTER
803+
CREATE_CLUSTER=$(get_input "Do you want the script to create the cluster for you now? (yes/no):" "yes")
804+
# read -e -p "Do you want the script to create the cluster for you now? (yes/no): " CREATE_CLUSTER
676805
if [[ "$CREATE_CLUSTER" == "yes" ]]; then
677806
warning
678807
create_cluster
@@ -683,7 +812,7 @@ main() {
683812
# Command to create the cluster
684813
echo -e "${GREEN} aws sagemaker create-cluster \\"
685814
echo -e "${GREEN} --cli-input-json file://cluster-config.json \\"
686-
echo -e "${GREEN} --region $AWS_REGION${NC}\n"
815+
echo -e "${GREEN} --region $AWS_REGION --output json${NC}\n"
687816

688817
echo -e "${YELLOW}To monitor the progress of cluster creation, you can either check the SageMaker console, or you can run:.${NC}"
689818
echo -e "${GREEN}watch -n 1 aws sagemaker list-clusters --output table${NC}"
16.5 MB
Loading

0 commit comments

Comments
 (0)