You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support recipes and scheduler in Hyperpod CLI (#41)
* add recipes feature for distributed training
* improve unit test coverage for recipes feature
* add support recipes along with command line args
* add recipes
* Crescendo helm chart for role and rolebinding (#17)
* update the helm chart to create team level roles and bindings
* revert unrelated changes
* Rename quotaAllocationTarget to computeQuotaTarget
* remove kueue related resources from helm chart
* Remove parameters of kueue from chart
* flip the team role creation to false
* Revise readme to add instructions to create the role and binding
* add changelog for distributed training
* change to public submodules
* QuotaAllocation support for Hyperpod CLI (#12)
* QuotaAllocation support for Hyperpod CLI
---------
Co-authored-by: Amazon GitHub Automation <[email protected]>
Co-authored-by: Song Jiang <[email protected]>
Co-authored-by: Baiyang Li <[email protected]>
Co-authored-by: baiyli <[email protected]>
* Remove custom_launcher folder
* sync with mainline
---------
Co-authored-by: cansun <[email protected]>
Co-authored-by: Amazon GitHub Automation <[email protected]>
Co-authored-by: Song Jiang <[email protected]>
Co-authored-by: Baiyang Li <[email protected]>
Co-authored-by: baiyli <[email protected]>
Co-authored-by: Can Sun <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,11 @@
1
1
# Changelog
2
2
3
+
## v2.0.0 (2024-12-04)
4
+
5
+
### Features
6
+
7
+
- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).
Copy file name to clipboardExpand all lines: README.md
+33-13Lines changed: 33 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ This documentation serves as a reference for the available HyperPod CLI commands
24
24
25
25
## Overview
26
26
27
-
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
27
+
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, patching and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
28
28
29
29
## Prerequisites
30
30
@@ -76,6 +76,10 @@ SageMaker HyperPod CLI currently supports start training job with:
76
76
```
77
77
hyperpod get-clusters
78
78
```
79
+
- Get your HyperPod clusters to show their capacities and quota allocation info for a team.
80
+
```
81
+
hyperpod get-clusters -n hyperpod-ns-<team-name>
82
+
```
79
83
- Connect to one HyperPod cluster and specify a namespace you have access to.
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
111
115
* `clusters` (list[string]) - Optional. A list of SageMaker HyperPod cluster names that users want to check the capacity for. This is useful for users who know some of their most commonly used clusters and want to check the capacity status of the clusters in the AWS account.
116
+
* `namespace` (string) - Optional. The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
112
117
* `orchestrator` (enum) - Optional. The orchestrator type for the cluster. Currently, `'eks'` is the only available option.
113
118
* `output` (enum) - Optional. The output format. Available values are `table` and `json`. The default value is `json`.
* `cluster-name` (string) - Required. The SageMaker HyperPod cluster name to configure with.
124
129
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
125
-
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
130
+
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
126
131
127
132
### Submitting a Job
128
133
129
134
This command submits a new training job to the connected SageMaker HyperPod cluster.
* `job-name` (string) - Required. The name of the job.
136
141
* `job-kind` (string) - Optional. The training job kind. The job type currently supported is `kubeflow/PyTorchJob`.
137
-
* `namespace` (string) - Optional. The namespace to use. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
142
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
138
143
* `image` (string) - Required. The image used when creating the training job.
139
144
* `pull-policy` (enum) - Optional. The policy to pull the container image. Valid values are `Always`, `IfNotPresent`, and `Never`, as available from the PyTorchJob. The default is `Always`.
140
145
* `command` (string) - Optional. The command to run the entrypoint script. Currently, only `torchrun` is supported.
* `tasks-per-node` (int) - Optional. The number of devices to use per instance.
147
152
* `label-selector` (dict[string, list[string]]) - Optional. A dictionary of labels and their values that will override the predefined node selection rules based on the SageMaker HyperPod `node-health-status` label and values. If users provide this field, the CLI will launch the job with this customized label selection.
148
153
* `deep-health-check-passed-nodes-only` (bool) - Optional. If set to `true`, the job will be launched only on nodes that have the `deep-health-check-status` label with the value `passed`.
149
-
* `scheduler-type` (enum) - Optional. The scheduler type to use. Currently, only `Kueue` is supported.
154
+
* `scheduler-type` (enum) - Optional. The scheduler type to use which can be `SageMaker`, `Kueue` or `None`. Default value is `SageMaker`.
150
155
* `queue-name` (string) - Optional. The name of the queue to submit the job to, which is created by the cluster admin users in your AWS account.
151
156
* `priority` (string) - Optional. The priority for the job, which needs to be created by the cluster admin users and match the name in the cluster.
152
157
* `auto-resume` (bool) - Optional. The flag to enable HyperPod resilience job auto resume. If set to `true`, the job will automatically resume after pod or node failure. To enable `auto-resume`, you also should set `restart-policy` to `OnFailure`.
* `job-name` (string) - Required. The name of the job.
170
-
* `namespace` (string) - Optional. The namespace to describe the job in. If not provided, the CLI will try to describe the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will describe the job from the specified namespace.
175
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
171
176
* `verbose` (flag) - Optional. If set to `True`, the command enables verbose mode and prints out more detailed output with additional fields.
172
177
173
178
### Listing Jobs
@@ -178,7 +183,7 @@ This command lists all the training jobs in the connected SageMaker HyperPod clu
* `namespace` (string) - Optional. The namespace to list the jobs in. If not provided, this command lists the jobs in the namespace specified during connecting to the cluster. If the namespace is provided and if the user has access to the namespace, this command lists the jobs from the specified namespace.
186
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
182
187
* `all-namespaces` (flag) - Optional. If set, this command lists jobs from all namespaces the data scientist users have access to. The namespace in the current AWS account credentials will be ignored, even if specified with the `--namespace` option.
183
188
* `selector` (string) - Optional. A label selector to filter the listed jobs. The selector supports the '=', '==', and '!=' operators (e.g., `-l key1=value1,key2=value2`).
* `job-name` (string) - Required. The name of the job to cancel.
194
-
* `namespace` (string) - Optional. The namespace to cancel the job in. If not provided, the CLI will try to cancel the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will cancel the job from the specified namespace.
199
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `job-name` (string) - Required. The name of the job to list pods for.
205
-
* `namespace` (string) - Optional. The namespace to list the pods in. If not provided, the CLI will list the pods in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will list the pods from the specified namespace.
210
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `job-name` (string) - Required. The name of the job to get the log for.
216
221
* `pod` (string) - Required. The name of the pod to get the log from.
217
-
* `namespace` (string) - Optional. The namespace to get the log from. If not provided, the CLI will get the log from the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will get the log from the pod in the specified namespace.
222
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `job-name` (string) - Required. The name of the job to execute the command within the container of a pod associated with a training job.
228
233
* `bash-command` (string) - Required. The bash command(s) to run.
229
-
* `namespace` (string) - Optional. The namespace to execute the command in. If not provided, the CLI will try to execute the command in the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will execute the command in the pod from the specified namespace.
234
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
230
235
* `pod` (string) - Optional. The name of the pod to execute the command in. You must provide either `--pod` or `--all-pods`.
231
-
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
236
+
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
237
+
238
+
### Patch Jobs
239
+
240
+
This command patches a job with certain operation. Currently only `suspend` and `unsuspend` are supported.
* `job-name` (string) - Required. The name of the job to be patched.
251
+
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
0 commit comments