Skip to content

Commit e716b54

Browse files
authored
Merge branch 'aws:main' into main-feature-get-op-logs
2 parents 9c51633 + 73a41b3 commit e716b54

File tree

21 files changed

+559
-174
lines changed

21 files changed

+559
-174
lines changed

README.md

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -54,24 +54,13 @@ SageMaker HyperPod CLI currently supports start training job with:
5454

5555
1. Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
5656

57-
1. Install ```helm```.
58-
59-
The SageMaker Hyperpod CLI uses Helm to start training jobs. See also the [Helm installation guide](https://helm.sh/docs/intro/install/).
60-
61-
```
62-
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
63-
chmod 700 get_helm.sh
64-
./get_helm.sh
65-
rm -f ./get_helm.sh
66-
```
67-
68-
1. Clone and install the sagemaker-hyperpod-cli package.
57+
2. Install the sagemaker-hyperpod-cli package.
6958

7059
```
7160
pip install sagemaker-hyperpod
7261
```
7362
74-
1. Verify if the installation succeeded by running the following command.
63+
3. Verify if the installation succeeded by running the following command.
7564
7665
```
7766
hyp --help
@@ -207,7 +196,7 @@ hyp invoke hyp-jumpstart-endpoint \
207196
208197
```
209198
hyp list hyp-jumpstart-endpoint
210-
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
199+
hyp describe hyp-jumpstart-endpoint --name endpoint-jumpstart
211200
```
212201
213202
#### Creating a Custom Inference Endpoint

doc/cli_training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ hyp create hyp-pytorch-job [OPTIONS]
4040
- `--tasks-per-node INTEGER`: Number of tasks per node (minimum: 1)
4141
- `--label-selector OBJECT`: Node label selector as key-value pairs
4242
- `--deep-health-check-passed-nodes-only BOOLEAN`: Schedule pods only on nodes that passed deep health check (default: false)
43-
- `--scheduler-type TEXT`: Scheduler type
43+
- `--scheduler-type TEXT`: If specified, training job pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.
4444
- `--queue-name TEXT`: Queue name for job scheduling (1-63 characters, alphanumeric with hyphens)
4545
- `--priority TEXT`: Priority class for job scheduling
4646
- `--max-retry INTEGER`: Maximum number of job retries (minimum: 0)

helm_chart/HyperPodHelmChart/Chart.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ version: 0.1.0
2424
appVersion: "1.16.0"
2525

2626
dependencies:
27+
- name: cert-manager
28+
version: "v1.18.2"
29+
repository: oci://quay.io/jetstack/charts
30+
condition: cert-manager.enabled
2731
- name: training-operators
2832
version: "0.1.0"
2933
repository: "file://charts/training-operators"

helm_chart/HyperPodHelmChart/values.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,15 @@ namespace:
115115
create: true
116116
name: aws-hyperpod
117117

118+
cert-manager:
119+
enabled: true
120+
namespace: cert-manager
121+
global:
122+
leaderElection:
123+
namespace: cert-manager
124+
crds:
125+
enabled: true
126+
118127
mlflow:
119128
enabled: false
120129

helm_chart/readme.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ More information about orchestration features for cluster admins [here](https://
3333
| [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/) | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | | Yes |
3434
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | HyperPod Resiliency | Yes |
3535
| hyperpod-inference-operator | Installs the HyperPod Inference Operator and its dependencies to the cluster, allowing cluster deployment and inferencing of JumpStart, s3-hosted, and FSx-hosted models | No |
36+
| [cert-manager](https://github.com/cert-manager/cert-manager) | Automatically provisions and manages TLS certificates in Kubernetes clusters. Provides certificate lifecycle management including issuance, renewal, and revocation for secure communications. | [Hyperpod training operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) | Yes |
3637

3738
> **_Note_** The `mpijob` scheme is disabled in the Training Operator helm chart to avoid conflicting with the MPI Operator.
3839
@@ -48,6 +49,20 @@ storage:
4849
enabled: true
4950
```
5051

52+
To enable cert-manager for TLS certificate management, pass in `--set cert-manager.enabled=true` when installing or upgrading the main chart or set the following in the values.yaml file:
53+
```
54+
cert-manager:
55+
enabled: true
56+
namespace: cert-manager
57+
global:
58+
leaderElection:
59+
namespace: cert-manager
60+
crds:
61+
enabled: true
62+
```
63+
namespace specifies which name space cert-manager should be installed
64+
65+
5166
---
5267

5368
The following plugins are only required for HyperPod Resiliency if you are using the following supported devices, such as GPU/Neuron instances, unless you install these plugins on your own.

hyperpod-custom-inference-template/hyperpod_custom_inference_template/v1_0/model.py

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
1111
# ANY KIND, either express or implied. See the License for the specific
1212
# language governing permissions and limitations under the License.
13-
from pydantic import BaseModel, Field
13+
from pydantic import BaseModel, Field, model_validator, ConfigDict
1414
from typing import Optional, List, Dict, Union, Literal
1515

1616
from sagemaker.hyperpod.inference.config.hp_endpoint_config import (
@@ -31,9 +31,19 @@
3131
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
3232

3333
class FlatHPEndpoint(BaseModel):
34+
model_config = ConfigDict(extra="forbid")
35+
36+
metadata_name: Optional[str] = Field(
37+
None,
38+
alias="metadata_name",
39+
description="Name of the jumpstart endpoint object",
40+
max_length=63,
41+
pattern=r"^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}$",
42+
)
43+
3444
# endpoint_name
3545
endpoint_name: Optional[str] = Field(
36-
"",
46+
None,
3747
alias="endpoint_name",
3848
description="Name of SageMaker endpoint; empty string means no creation",
3949
max_length=63,
@@ -130,7 +140,7 @@ class FlatHPEndpoint(BaseModel):
130140
description="FSX File System DNS Name",
131141
)
132142
fsx_file_system_id: Optional[str] = Field(
133-
...,
143+
None,
134144
alias="fsx_file_system_id",
135145
description="FSX File System ID",
136146
)
@@ -142,12 +152,12 @@ class FlatHPEndpoint(BaseModel):
142152

143153
# S3Storage
144154
s3_bucket_name: Optional[str] = Field(
145-
...,
155+
None,
146156
alias="s3_bucket_name",
147157
description="S3 bucket location",
148158
)
149159
s3_region: Optional[str] = Field(
150-
...,
160+
None,
151161
alias="s3_region",
152162
description="S3 bucket region",
153163
)
@@ -229,12 +239,22 @@ class FlatHPEndpoint(BaseModel):
229239
invocation_endpoint: Optional[str] = Field(
230240
default="invocations",
231241
description=(
232-
"The invocation endpoint of the model server. "
233-
"http://<host>:<port>/ would be pre-populated based on the other fields. "
242+
"The invocation endpoint of the model server. http://<host>:<port>/ would be pre-populated based on the other fields. "
234243
"Please fill in the path after http://<host>:<port>/ specific to your model server.",
235244
)
236245
)
237-
246+
247+
@model_validator(mode='after')
248+
def validate_model_source_config(self):
249+
"""Validate that required fields are provided based on model_source_type"""
250+
if self.model_source_type == "s3":
251+
if not self.s3_bucket_name or not self.s3_region:
252+
raise ValueError("s3_bucket_name and s3_region are required when model_source_type is 's3'")
253+
elif self.model_source_type == "fsx":
254+
if not self.fsx_file_system_id:
255+
raise ValueError("fsx_file_system_id is required when model_source_type is 'fsx'")
256+
return self
257+
238258
def to_domain(self) -> HPEndpoint:
239259
env_vars = None
240260
if self.env:

0 commit comments

Comments
 (0)