Skip to content

Commit bb25aed

Browse files
xiaoxshecan-sunamazon-autojswudibaiyli
authored
Support recipes and scheduler in Hyperpod CLI (#41)
* add recipes feature for distributed training * improve unit test coverage for recipes feature * add support recipes along with command line args * add recipes * Crescendo helm chart for role and rolebinding (#17) * update the helm chart to create team level roles and bindings * revert unrelated changes * Rename quotaAllocationTarget to computeQuotaTarget * remove kueue related resources from helm chart * Remove parameters of kueue from chart * flip the team role creation to false * Revise readme to add instructions to create the role and binding * add changelog for distributed training * change to public submodules * QuotaAllocation support for Hyperpod CLI (#12) * QuotaAllocation support for Hyperpod CLI --------- Co-authored-by: Amazon GitHub Automation <[email protected]> Co-authored-by: Song Jiang <[email protected]> Co-authored-by: Baiyang Li <[email protected]> Co-authored-by: baiyli <[email protected]> * Remove custom_launcher folder * sync with mainline --------- Co-authored-by: cansun <[email protected]> Co-authored-by: Amazon GitHub Automation <[email protected]> Co-authored-by: Song Jiang <[email protected]> Co-authored-by: Baiyang Li <[email protected]> Co-authored-by: baiyli <[email protected]> Co-authored-by: Can Sun <[email protected]>
1 parent 87f5660 commit bb25aed

File tree

69 files changed

+3475
-2392
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+3475
-2392
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,7 @@ __pycache__/
1717

1818
/doc/_apidoc/
1919
/build
20+
21+
# Ignore all contents of result and results directories
22+
/result/
23+
/results/

.gitmodules

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
[submodule "src/hyperpod_cli/custom_launcher/launcher/nemo/nemo_framework_launcher"]
2-
path = src/hyperpod_cli/custom_launcher/launcher/nemo/nemo_framework_launcher
3-
url = https://github.com/NVIDIA/NeMo-Framework-Launcher.git
4-
branch = 3d41c31
1+
[submodule "src/hyperpod_cli/sagemaker_hyperpod_recipes"]
2+
path = src/hyperpod_cli/sagemaker_hyperpod_recipes
3+
url = https://github.com/aws/sagemaker-hyperpod-recipes.git

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Changelog
22

3+
## v2.0.0 (2024-12-04)
4+
5+
### Features
6+
7+
- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).
8+
39
## v1.0.0 (2024-09-09)
410

511
### Features

README.md

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ This documentation serves as a reference for the available HyperPod CLI commands
2424

2525
## Overview
2626

27-
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
27+
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, patching and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
2828

2929
## Prerequisites
3030

@@ -76,6 +76,10 @@ SageMaker HyperPod CLI currently supports start training job with:
7676
```
7777
hyperpod get-clusters
7878
```
79+
- Get your HyperPod clusters to show their capacities and quota allocation info for a team.
80+
```
81+
hyperpod get-clusters -n hyperpod-ns-<team-name>
82+
```
7983
- Connect to one HyperPod cluster and specify a namespace you have access to.
8084
```
8185
hyperpod connect-cluster --cluster-name <cluster-name>
@@ -104,11 +108,12 @@ The HyperPod CLI provides the following commands:
104108
This command lists the available SageMaker HyperPod clusters and their capacity information.
105109
106110
```
107-
hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--orchestrator <eks>] [--output <json|table>]
111+
hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--namespace <namespace>] [--orchestrator <eks>] [--output <json|table>]
108112
```
109113
110114
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
111115
* `clusters` (list[string]) - Optional. A list of SageMaker HyperPod cluster names that users want to check the capacity for. This is useful for users who know some of their most commonly used clusters and want to check the capacity status of the clusters in the AWS account.
116+
* `namespace` (string) - Optional. The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
112117
* `orchestrator` (enum) - Optional. The orchestrator type for the cluster. Currently, `'eks'` is the only available option.
113118
* `output` (enum) - Optional. The output format. Available values are `table` and `json`. The default value is `json`.
114119
@@ -122,19 +127,19 @@ hyperpod connect-cluster --cluster-name <cluster-name> [--region <region>] [--na
122127
123128
* `cluster-name` (string) - Required. The SageMaker HyperPod cluster name to configure with.
124129
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
125-
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
130+
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
126131
127132
### Submitting a Job
128133
129134
This command submits a new training job to the connected SageMaker HyperPod cluster.
130135
131136
```
132-
hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <kubeflow/PyTorchJob>] [--image <image>] [--command <command>] [--entry-script <script>] [--script-args <arg1 arg2>] [--environment <key=value>] [--pull-policy <Always|IfNotPresent|Never>] [--instance-type <instance-type>] [--node-count <count>] [--tasks-per-node <count>] [--label-selector <key=value>] [--deep-health-check-passed-nodes-only] [--scheduler-type <Kueue>] [--queue-name <queue-name>] [--priority <priority>] [--auto-resume] [--max-retry <count>] [--restart-policy <Always|OnFailure|Never|ExitCode>] [--volumes <volume1,volume2>] [--persistent-volume-claims <claim1:/mount/path,claim2:/mount/path>] [--results-dir <dir>] [--service-account-name <account>]
137+
hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <kubeflow/PyTorchJob>] [--image <image>] [--command <command>] [--entry-script <script>] [--script-args <arg1 arg2>] [--environment <key=value>] [--pull-policy <Always|IfNotPresent|Never>] [--instance-type <instance-type>] [--node-count <count>] [--tasks-per-node <count>] [--label-selector <key=value>] [--deep-health-check-passed-nodes-only] [--scheduler-type <Kueue SageMaker None>] [--queue-name <queue-name>] [--priority <priority>] [--auto-resume] [--max-retry <count>] [--restart-policy <Always|OnFailure|Never|ExitCode>] [--volumes <volume1,volume2>] [--persistent-volume-claims <claim1:/mount/path,claim2:/mount/path>] [--results-dir <dir>] [--service-account-name <account>]
133138
```
134139
135140
* `job-name` (string) - Required. The name of the job.
136141
* `job-kind` (string) - Optional. The training job kind. The job type currently supported is `kubeflow/PyTorchJob`.
137-
* `namespace` (string) - Optional. The namespace to use. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
142+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
138143
* `image` (string) - Required. The image used when creating the training job.
139144
* `pull-policy` (enum) - Optional. The policy to pull the container image. Valid values are `Always`, `IfNotPresent`, and `Never`, as available from the PyTorchJob. The default is `Always`.
140145
* `command` (string) - Optional. The command to run the entrypoint script. Currently, only `torchrun` is supported.
@@ -146,7 +151,7 @@ hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <
146151
* `tasks-per-node` (int) - Optional. The number of devices to use per instance.
147152
* `label-selector` (dict[string, list[string]]) - Optional. A dictionary of labels and their values that will override the predefined node selection rules based on the SageMaker HyperPod `node-health-status` label and values. If users provide this field, the CLI will launch the job with this customized label selection.
148153
* `deep-health-check-passed-nodes-only` (bool) - Optional. If set to `true`, the job will be launched only on nodes that have the `deep-health-check-status` label with the value `passed`.
149-
* `scheduler-type` (enum) - Optional. The scheduler type to use. Currently, only `Kueue` is supported.
154+
* `scheduler-type` (enum) - Optional. The scheduler type to use which can be `SageMaker`, `Kueue` or `None`. Default value is `SageMaker`.
150155
* `queue-name` (string) - Optional. The name of the queue to submit the job to, which is created by the cluster admin users in your AWS account.
151156
* `priority` (string) - Optional. The priority for the job, which needs to be created by the cluster admin users and match the name in the cluster.
152157
* `auto-resume` (bool) - Optional. The flag to enable HyperPod resilience job auto resume. If set to `true`, the job will automatically resume after pod or node failure. To enable `auto-resume`, you also should set `restart-policy` to `OnFailure`.
@@ -167,7 +172,7 @@ hyperpod get-job --job-name <job-name> [--namespace <namespace>] [--verbose]
167172
```
168173
169174
* `job-name` (string) - Required. The name of the job.
170-
* `namespace` (string) - Optional. The namespace to describe the job in. If not provided, the CLI will try to describe the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will describe the job from the specified namespace.
175+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
171176
* `verbose` (flag) - Optional. If set to `True`, the command enables verbose mode and prints out more detailed output with additional fields.
172177
173178
### Listing Jobs
@@ -178,7 +183,7 @@ This command lists all the training jobs in the connected SageMaker HyperPod clu
178183
hyperpod list-jobs [--namespace <namespace>] [--all-namespaces] [--selector <key=value>]
179184
```
180185
181-
* `namespace` (string) - Optional. The namespace to list the jobs in. If not provided, this command lists the jobs in the namespace specified during connecting to the cluster. If the namespace is provided and if the user has access to the namespace, this command lists the jobs from the specified namespace.
186+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
182187
* `all-namespaces` (flag) - Optional. If set, this command lists jobs from all namespaces the data scientist users have access to. The namespace in the current AWS account credentials will be ignored, even if specified with the `--namespace` option.
183188
* `selector` (string) - Optional. A label selector to filter the listed jobs. The selector supports the '=', '==', and '!=' operators (e.g., `-l key1=value1,key2=value2`).
184189
@@ -191,7 +196,7 @@ hyperpod cancel-job --job-name <job-name> [--namespace <namespace>]
191196
```
192197
193198
* `job-name` (string) - Required. The name of the job to cancel.
194-
* `namespace` (string) - Optional. The namespace to cancel the job in. If not provided, the CLI will try to cancel the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will cancel the job from the specified namespace.
199+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
195200
196201
### Listing Pods
197202
@@ -202,7 +207,7 @@ hyperpod list-pods --job-name <job-name> [--namespace <namespace>]
202207
```
203208
204209
* `job-name` (string) - Required. The name of the job to list pods for.
205-
* `namespace` (string) - Optional. The namespace to list the pods in. If not provided, the CLI will list the pods in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will list the pods from the specified namespace.
210+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
206211
207212
### Accessing Logs
208213
@@ -214,7 +219,7 @@ hyperpod get-log --job-name <job-name> --pod <pod-name> [--namespace <namespace>
214219
215220
* `job-name` (string) - Required. The name of the job to get the log for.
216221
* `pod` (string) - Required. The name of the pod to get the log from.
217-
* `namespace` (string) - Optional. The namespace to get the log from. If not provided, the CLI will get the log from the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will get the log from the pod in the specified namespace.
222+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
218223
219224
### Executing Commands
220225
@@ -226,6 +231,21 @@ hyperpod exec --job-name <job-name> [-p <pod-name>] [--all-pods] -- <command>
226231
227232
* `job-name` (string) - Required. The name of the job to execute the command within the container of a pod associated with a training job.
228233
* `bash-command` (string) - Required. The bash command(s) to run.
229-
* `namespace` (string) - Optional. The namespace to execute the command in. If not provided, the CLI will try to execute the command in the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will execute the command in the pod from the specified namespace.
234+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
230235
* `pod` (string) - Optional. The name of the pod to execute the command in. You must provide either `--pod` or `--all-pods`.
231-
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
236+
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
237+
238+
### Patch Jobs
239+
240+
This command patches a job with certain operation. Currently only `suspend` and `unsuspend` are supported.
241+
242+
```
243+
hyperpod patch-job suspend --job-name <job-name> [--namespace <namespace>]
244+
```
245+
246+
```
247+
hyperpod patch-job unsuspend --job-name <job-name> [--namespace <namespace>]
248+
```
249+
250+
* `job-name` (string) - Required. The name of the job to be patched.
251+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.

examples/basic-job-example-config.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,10 @@ cluster:
6464
# Mapping to '--namespace' argument in 'start-job' command.
6565
namespace: kubeflow
6666
# custom_labels: Optional. Used to specify the name of the queue, which is created by the cluster admin users.
67+
# The priority class label is mapped to '--priority' argument in 'start-job' command if your scheduler type is 'SageMaker'.
6768
# custom_labels:
6869
# kueue.x-k8s.io/queue-name: low-priority-queue2
70+
# kueue.x-k8s.io/priority-class: sample-priority
6971
custom_labels: null
7072
# priority_class_name: Optional. The priority for the job, which is created by the cluster admin users.
7173
# Mapping to '--priority' argument in 'start-job' command.
@@ -96,7 +98,11 @@ cluster:
9698
# To use SageMaker Hyperpod AutoResume functionality, please set it to OnFailure.
9799
# Mapping to '--restart-policy' argument in 'start-job' command.
98100
restartPolicy: OnFailure
99-
101+
# scheduler_type: Optional. Used to decide which type of scheduler to use. Default value is 'SageMaker' which makes
102+
# the job only scheduled on queues created via SageMaker. Another valid value is 'Kueue', with this option, queue name
103+
# and namespace has to be manually filled out.
104+
# scheduler_type: Kueue
105+
scheduler_type: SageMaker
100106
# base_results_dir: Optional. Location to store the results, checkpoints and logs.
101107
# Mapping to '--results-dir' argument in 'start-job' command.
102108
base_results_dir: ./result

helm_chart/HyperPodHelmChart/Chart.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,7 @@ dependencies:
4242
- name: neuron-device-plugin
4343
version: "0.1.0"
4444
repository: "file://charts/neuron-device-plugin"
45-
condition: neuron-device-plugin.devicePlugin.enabled
46-
- name: kueue
47-
version: "0.1.0"
48-
repository: "file://charts/kueue"
49-
condition: kueue.enabled
45+
condition: neuron-device-plugin.devicePlugin.enabled
5046
- name: storage
5147
version: "0.1.0"
5248
repository: "file://charts/storage"
@@ -75,3 +71,7 @@ dependencies:
7571
version: "0.1.0"
7672
repository: "file://charts/namespaced-role-and-bindings"
7773
condition: namespaced-role-and-bindings.enabled
74+
- name: team-role-and-bindings
75+
version: "0.1.0"
76+
repository: "file://charts/team-role-and-bindings"
77+
condition: team-role-and-bindings.enabled

helm_chart/HyperPodHelmChart/charts/kueue/Chart.yaml

Lines changed: 0 additions & 7 deletions
This file was deleted.

helm_chart/HyperPodHelmChart/charts/kueue/templates/priority-class.yaml

Lines changed: 0 additions & 10 deletions
This file was deleted.

helm_chart/HyperPodHelmChart/charts/kueue/templates/queue.yaml

Lines changed: 0 additions & 17 deletions
This file was deleted.

helm_chart/HyperPodHelmChart/charts/kueue/values.yaml

Lines changed: 0 additions & 21 deletions
This file was deleted.

0 commit comments

Comments
 (0)