Skip to content

Commit e1ac050

Browse files
authored
Update docs for Cluster Management (#240)
**Description** Made updates to the docs to reflect the latest state and add improvement to the content to make them more useful. **Testing Done** The docs look correct, verified each action item.
1 parent 16b48dd commit e1ac050

File tree

5 files changed

+218
-45
lines changed

5 files changed

+218
-45
lines changed

doc/cli/cluster_management/cli_cluster_management.md

Lines changed: 93 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ Complete reference for SageMaker HyperPod cluster management parameters and conf
1010

1111
* [Initialize Configuration](#hyp-init)
1212
* [Create Cluster Stack](#hyp-create)
13-
* [Update Cluster](#hyp-update-hyp-cluster)
14-
* [List Cluster Stacks](#hyp-list-hyp-cluster)
15-
* [Describe Cluster Stack](#hyp-describe-hyp-cluster)
13+
* [Update Cluster](#hyp-update-cluster)
14+
* [List Cluster Stacks](#hyp-list-cluster-stack)
15+
* [Describe Cluster Stack](#hyp-describe-cluster-stack)
1616
* [List HyperPod Clusters](#hyp-list-cluster)
1717
* [Set Cluster Context](#hyp-set-cluster-context)
1818
* [Get Cluster Context](#hyp-get-cluster-context)
@@ -36,12 +36,14 @@ hyp init TEMPLATE [DIRECTORY] [OPTIONS]
3636

3737
| Parameter | Type | Required | Description |
3838
|-----------|------|----------|-------------|
39-
| `TEMPLATE` | CHOICE | Yes | Template type (hyp-cluster, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) |
39+
| `TEMPLATE` | CHOICE | Yes | Template type (cluster-stack, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) |
4040
| `DIRECTORY` | PATH | No | Target directory (default: current directory) |
4141
| `--version` | TEXT | No | Schema version to use |
4242

4343
```{important}
4444
The `resource_name_prefix` parameter in the generated `config.yaml` file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.
45+
46+
**Cluster stack names must be unique within each AWS region.** If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
4547
```
4648

4749
## hyp create
@@ -61,14 +63,18 @@ hyp create [OPTIONS]
6163
| `--region` | TEXT | No | AWS region where the cluster stack will be created |
6264
| `--debug` | FLAG | No | Enable debug logging |
6365

64-
## hyp update hyp-cluster
66+
## hyp update cluster
6567

6668
Update an existing HyperPod cluster configuration.
6769

70+
```{important}
71+
**Runtime vs Configuration Commands**: This command modifies an **existing, deployed cluster's** runtime settings (instance groups, node recovery). This is different from `hyp configure`, which only modifies local configuration files before cluster creation.
72+
```
73+
6874
#### Syntax
6975

7076
```bash
71-
hyp update hyp-cluster [OPTIONS]
77+
hyp update cluster [OPTIONS]
7278
```
7379

7480
#### Parameters
@@ -82,14 +88,14 @@ hyp update hyp-cluster [OPTIONS]
8288
| `--node-recovery` | TEXT | No | Node recovery setting (Automatic or None) |
8389
| `--debug` | FLAG | No | Enable debug logging |
8490

85-
## hyp list hyp-cluster
91+
## hyp list cluster-stack
8692

8793
List all HyperPod cluster stacks (CloudFormation stacks).
8894

8995
#### Syntax
9096

9197
```bash
92-
hyp list hyp-cluster [OPTIONS]
98+
hyp list cluster-stack [OPTIONS]
9399
```
94100

95101
#### Parameters
@@ -100,14 +106,18 @@ hyp list hyp-cluster [OPTIONS]
100106
| `--status` | TEXT | No | Filter by stack status. Format: "['CREATE_COMPLETE', 'UPDATE_COMPLETE']" |
101107
| `--debug` | FLAG | No | Enable debug logging |
102108

103-
## hyp describe hyp-cluster
109+
## hyp describe cluster-stack
104110

105111
Describe a specific HyperPod cluster stack.
106112

113+
```{note}
114+
**Region-Specific Stack Names**: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack.
115+
```
116+
107117
#### Syntax
108118

109119
```bash
110-
hyp describe hyp-cluster STACK-NAME [OPTIONS]
120+
hyp describe cluster-stack STACK-NAME [OPTIONS]
111121
```
112122

113123
#### Parameters
@@ -195,6 +205,10 @@ hyp get-monitoring [OPTIONS]
195205

196206
Configure cluster parameters interactively or via command line.
197207

208+
```{important}
209+
**Pre-Deployment Configuration**: This command modifies local `config.yaml` files **before** cluster creation. For updating **existing, deployed clusters**, use `hyp update cluster` instead.
210+
```
211+
198212
#### Syntax
199213

200214
```bash
@@ -208,13 +222,23 @@ This command dynamically supports all configuration parameters available in the
208222
| Parameter | Type | Required | Description |
209223
|-----------|------|----------|-------------|
210224
| `--resource-name-prefix` | TEXT | No | Prefix for all AWS resources |
211-
| `--stage` | TEXT | No | Deployment stage ("gamma" or "prod") |
212-
| `--vpc-cidr` | TEXT | No | VPC CIDR block |
213-
| `--kubernetes-version` | TEXT | No | Kubernetes version for EKS cluster |
225+
| `--create-hyperpod-cluster-stack` | BOOLEAN | No | Create HyperPod Cluster Stack |
226+
| `--hyperpod-cluster-name` | TEXT | No | Name of SageMaker HyperPod Cluster |
227+
| `--create-eks-cluster-stack` | BOOLEAN | No | Create EKS Cluster Stack |
228+
| `--kubernetes-version` | TEXT | No | Kubernetes version |
229+
| `--eks-cluster-name` | TEXT | No | Name of the EKS cluster |
230+
| `--create-helm-chart-stack` | BOOLEAN | No | Create Helm Chart Stack |
231+
| `--namespace` | TEXT | No | Namespace to deploy HyperPod Helm chart |
232+
| `--node-provisioning-mode` | TEXT | No | Continuous provisioning mode |
214233
| `--node-recovery` | TEXT | No | Node recovery setting ("Automatic" or "None") |
215-
| `--env` | JSON | No | Environment variables as JSON object |
216-
| `--args` | JSON | No | Command arguments as JSON array |
217-
| `--command` | JSON | No | Command to run as JSON array |
234+
| `--create-vpc-stack` | BOOLEAN | No | Create VPC Stack |
235+
| `--vpc-id` | TEXT | No | Existing VPC ID |
236+
| `--vpc-cidr` | TEXT | No | VPC CIDR block |
237+
| `--create-security-group-stack` | BOOLEAN | No | Create Security Group Stack |
238+
| `--enable-hp-inference-feature` | BOOLEAN | No | Enable inference operator |
239+
| `--stage` | TEXT | No | Deployment stage ("gamma" or "prod") |
240+
| `--create-fsx-stack` | BOOLEAN | No | Create FSx Stack |
241+
| `--storage-capacity` | INTEGER | No | FSx storage capacity in GiB |
218242
| `--tags` | JSON | No | Resource tags as JSON object |
219243

220244
**Note:** The exact parameters available depend on your current template type and version. Run `hyp configure --help` to see all available options for your specific configuration.
@@ -302,18 +326,56 @@ The `config.yaml` file supports the following parameters:
302326

303327
| Parameter | Type | Description | Default |
304328
|-----------|------|-------------|---------|
305-
| `template` | TEXT | Template name | "hyp-cluster" |
306-
| `namespace` | TEXT | Kubernetes namespace | "kube-system" |
307-
| `stage` | TEXT | Deployment stage | "gamma" |
308-
| `resource_name_prefix` | TEXT | Resource name prefix | "sagemaker-hyperpod-eks" |
309-
| `vpc_cidr` | TEXT | VPC CIDR block | "10.192.0.0/16" |
329+
| `resource_name_prefix` | TEXT | Prefix for all AWS resources (4-digit UUID added during submission) | "hyp-eks-stack" |
330+
| `create_hyperpod_cluster_stack` | BOOLEAN | Create HyperPod Cluster Stack | true |
331+
| `hyperpod_cluster_name` | TEXT | Name of SageMaker HyperPod Cluster | "hyperpod-cluster" |
332+
| `create_eks_cluster_stack` | BOOLEAN | Create EKS Cluster Stack | true |
310333
| `kubernetes_version` | TEXT | Kubernetes version | "1.31" |
311-
| `node_recovery` | TEXT | Node recovery setting | "Automatic" |
312-
| `create_vpc_stack` | BOOLEAN | Create new VPC | true |
313-
| `create_eks_cluster_stack` | BOOLEAN | Create new EKS cluster | true |
314-
| `create_hyperpod_cluster_stack` | BOOLEAN | Create HyperPod cluster | true |
315-
316-
**Note:** The actual available configuration parameters depend on the specific template schema version. Use `hyp init hyp-cluster` to see all available parameters for your version.
334+
| `eks_cluster_name` | TEXT | Name of the EKS cluster | "eks-cluster" |
335+
| `create_helm_chart_stack` | BOOLEAN | Create Helm Chart Stack | true |
336+
| `namespace` | TEXT | Namespace to deploy HyperPod Helm chart | "kube-system" |
337+
| `helm_repo_url` | TEXT | URL of Helm repo containing HyperPod Helm chart | "https://github.com/aws/sagemaker-hyperpod-cli.git" |
338+
| `helm_repo_path` | TEXT | Path to HyperPod Helm chart in repo | "helm_chart/HyperPodHelmChart" |
339+
| `helm_operators` | TEXT | Configuration of HyperPod Helm chart | "mlflow.enabled=true,trainingOperators.enabled=true,..." |
340+
| `helm_release` | TEXT | Name for Helm chart release | "dependencies" |
341+
| `node_provisioning_mode` | TEXT | Continuous provisioning mode ("Continuous" or empty) | "Continuous" |
342+
| `node_recovery` | TEXT | Automatic node recovery ("Automatic" or "None") | "Automatic" |
343+
| `instance_group_settings` | ARRAY | List of instance group configurations | [Default controller group] |
344+
| `rig_settings` | ARRAY | Restricted instance group configurations | null |
345+
| `rig_s3_bucket_name` | TEXT | S3 bucket for RIG resources | null |
346+
| `tags` | ARRAY | Custom tags for SageMaker HyperPod cluster | null |
347+
| `create_vpc_stack` | BOOLEAN | Create VPC Stack | true |
348+
| `vpc_id` | TEXT | Existing VPC ID (if not creating new) | null |
349+
| `vpc_cidr` | TEXT | IP range for VPC | "10.192.0.0/16" |
350+
| `availability_zone_ids` | ARRAY | List of AZs to deploy subnets | null |
351+
| `create_security_group_stack` | BOOLEAN | Create Security Group Stack | true |
352+
| `security_group_id` | TEXT | Existing security group ID | null |
353+
| `security_group_ids` | ARRAY | Security groups for HyperPod cluster | null |
354+
| `private_subnet_ids` | ARRAY | Private subnet IDs for HyperPod cluster | null |
355+
| `eks_private_subnet_ids` | ARRAY | Private subnet IDs for EKS cluster | null |
356+
| `nat_gateway_ids` | ARRAY | NAT Gateway IDs for internet routing | null |
357+
| `private_route_table_ids` | ARRAY | Private route table IDs | null |
358+
| `create_s3_endpoint_stack` | BOOLEAN | Create S3 Endpoint stack | true |
359+
| `enable_hp_inference_feature` | BOOLEAN | Enable inference operator | false |
360+
| `stage` | TEXT | Deployment stage ("gamma" or "prod") | "prod" |
361+
| `custom_bucket_name` | TEXT | S3 bucket name for templates | "sagemaker-hyperpod-cluster-stack-bucket" |
362+
| `create_life_cycle_script_stack` | BOOLEAN | Create Life Cycle Script Stack | true |
363+
| `create_s3_bucket_stack` | BOOLEAN | Create S3 Bucket Stack | true |
364+
| `s3_bucket_name` | TEXT | S3 bucket for cluster lifecycle scripts | "s3-bucket" |
365+
| `github_raw_url` | TEXT | Raw GitHub URL for lifecycle script | "https://raw.githubusercontent.com/aws-samples/..." |
366+
| `on_create_path` | TEXT | File name of lifecycle script | "sagemaker-hyperpod-eks-bucket" |
367+
| `create_sagemaker_iam_role_stack` | BOOLEAN | Create SageMaker IAM Role Stack | true |
368+
| `sagemaker_iam_role_name` | TEXT | IAM role name for SageMaker cluster creation | "create-cluster-role" |
369+
| `create_fsx_stack` | BOOLEAN | Create FSx Stack | true |
370+
| `fsx_subnet_id` | TEXT | Subnet ID for FSx creation | "" |
371+
| `fsx_availability_zone_id` | TEXT | Availability zone for FSx subnet | "" |
372+
| `per_unit_storage_throughput` | INTEGER | Per unit storage throughput | 250 |
373+
| `data_compression_type` | TEXT | Data compression type ("NONE" or "LZ4") | "NONE" |
374+
| `file_system_type_version` | FLOAT | File system type version | 2.15 |
375+
| `storage_capacity` | INTEGER | Storage capacity in GiB | 1200 |
376+
| `fsx_file_system_id` | TEXT | Existing FSx file system ID | "" |
377+
378+
**Note:** The actual available configuration parameters depend on the specific template schema version. Use `hyp init cluster-stack` to see all available parameters for your version.
317379

318380
## Examples
319381

@@ -325,7 +387,7 @@ mkdir my-hyperpod-cluster
325387
cd my-hyperpod-cluster
326388

327389
# Initialize cluster configuration
328-
hyp init hyp-cluster
390+
hyp init cluster-stack
329391

330392
# Configure basic parameters
331393
hyp configure --resource-name-prefix my-cluster --stage prod
@@ -341,7 +403,7 @@ hyp create --region us-west-2
341403

342404
```bash
343405
# Update instance groups
344-
hyp update hyp-cluster \
406+
hyp update cluster \
345407
--cluster-name my-cluster \
346408
--instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \
347409
--region us-west-2
@@ -351,10 +413,10 @@ hyp update hyp-cluster \
351413

352414
```bash
353415
# List all cluster stacks
354-
hyp list hyp-cluster --region us-west-2
416+
hyp list cluster-stack --region us-west-2
355417

356418
# Describe specific cluster stack
357-
hyp describe hyp-cluster my-stack-name --region us-west-2
419+
hyp describe cluster-stack my-stack-name --region us-west-2
358420

359421
# List HyperPod clusters with capacity info
360422
hyp list-cluster --region us-west-2 --output table

doc/cli/cluster_management/cli_cluster_management_autogen.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
.. ========================================
55
66
.. .. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:create_cluster_stack
7-
.. .. :prog: hyp create hyp-cluster
7+
.. .. :prog: hyp create cluster-stack
88
99
.. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:describe_cluster_stack
10-
.. :prog: hyp describe hyp-cluster
10+
.. :prog: hyp describe cluster-stack
1111
1212
.. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:list_cluster_stacks
13-
.. :prog: hyp list hyp-cluster
13+
.. :prog: hyp list cluster-stack
1414
1515
.. .. click:: sagemaker.hyperpod.cli.commands.cluster_stack:update_cluster
16-
.. :prog: hyp update hyp-cluster
16+
.. :prog: hyp update cluster

doc/examples.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22

33
# Example Notebooks
44

5+
## Cluster Management Example Notebooks
6+
7+
For detailed examples of cluster management with HyperPod, see:
8+
9+
::::{grid} 1 2 2 2
10+
:gutter: 3
11+
12+
:::{grid-item-card} CLI Cluster Management Example
13+
:link: https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/cluster_management/cluster_creation_init_experience.ipynb
14+
:class-card: sd-border-primary
15+
16+
**Cluster Management Examples** Refer the Cluster Management CLI Example.
17+
:::
18+
19+
:::{grid-item-card} SDK Cluster Management Example
20+
:link: https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/cluster_management/cluster_creation_sdk_experience.ipynb
21+
:class-card: sd-border-primary
22+
23+
**Cluster Management Examples** Refer the Cluster Management SDK Example.
24+
:::
25+
26+
::::
27+
528
## Training Example Notebooks
629

730
For detailed examples of training with HyperPod, see:
@@ -47,4 +70,4 @@ For detailed examples of inference with HyperPod, see:
4770

4871
:::
4972

50-
::::
73+
::::

doc/getting_started/cluster_management.rst

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Before you begin, ensure you have:
1515
.. note::
1616
**Region Configuration**: For commands that accept the ``--region`` option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.
1717

18+
**Cluster stack names must be unique within each AWS region.** If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
19+
1820
Creating Your First Cluster
1921
----------------------------
2022

@@ -37,7 +39,7 @@ It's recommended to start with a new and clean directory for each cluster config
3739

3840
.. code-block:: bash
3941
40-
hyp init hyp-cluster
42+
hyp init cluster-stack
4143
4244
This creates three files:
4345

@@ -59,24 +61,30 @@ The config.yaml file contains key parameters like:
5961

6062
.. code-block:: yaml
6163
62-
template: hyp-cluster
64+
template: cluster-stack
6365
namespace: kube-system
6466
stage: gamma
6567
resource_name_prefix: sagemaker-hyperpod-eks
6668
67-
**Option 2: Use CLI/SDK commands**
69+
**Option 2: Use CLI/SDK commands (Pre-Deployment)**
6870

6971
.. tab-set::
7072

7173
.. tab-item:: CLI
7274

7375
.. code-block:: bash
7476
75-
hyp configure --resource-name-prefix your-resource-prefix
77+
hyp configure --resource-name-prefix your-resource-prefix
78+
79+
.. note::
80+
The ``hyp configure`` command only modifies local configuration files. It does not affect existing deployed clusters.
7681

7782
4. Create the Cluster
7883
~~~~~~~~~~~~~~~~~~~~~
7984

85+
.. warning::
86+
**Cluster Stack Name Uniqueness**: Cluster stack names must be unique within each AWS region. Ensure your ``resource_name_prefix`` in ``config.yaml`` generates a unique stack name for the target region to avoid deployment conflicts.
87+
8088
.. tab-set::
8189

8290
.. tab-item:: CLI
@@ -102,7 +110,7 @@ Check the status of your cluster:
102110

103111
.. code-block:: bash
104112
105-
hyp describe hyp-cluster your-cluster-name --region your-region
113+
hyp describe cluster-stack your-cluster-name --region your-region
106114
107115
.. tab-item:: SDK
108116

@@ -114,6 +122,9 @@ Check the status of your cluster:
114122
response = HpClusterStack.describe("your-cluster-name", region="your-region")
115123
print(f"Stack Status: {response['Stacks'][0]['StackStatus']}")
116124
print(f"Stack Name: {response['Stacks'][0]['StackName']}")
125+
126+
.. note::
127+
**Region-Specific Stack Names**: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack.
117128

118129

119130
List all clusters:
@@ -124,7 +135,7 @@ List all clusters:
124135

125136
.. code-block:: bash
126137
127-
hyp list hyp-cluster --region your-region
138+
hyp list cluster-stack --region your-region
128139
129140
.. tab-item:: SDK
130141

@@ -144,13 +155,21 @@ Common Operations
144155
Update a Cluster
145156
~~~~~~~~~~~~~~~~~
146157

158+
.. important::
159+
**Runtime vs Configuration Commands**:
160+
161+
- ``hyp update cluster`` modifies **existing, deployed clusters** (runtime settings like instance groups, node recovery)
162+
- ``hyp configure`` modifies local ``config.yaml`` files **before** cluster creation
163+
164+
Use the appropriate command based on whether your cluster is already deployed or not.
165+
147166
.. tab-set::
148167

149168
.. tab-item:: CLI
150169

151170
.. code-block:: bash
152171
153-
hyp update hyp-cluster \
172+
hyp update cluster \
154173
--cluster-name your-cluster-name \
155174
--instance-groups "[]" \
156175
--region your-region

0 commit comments

Comments
 (0)