Skip to content

Commit f03ecf5

Browse files
author
Mohamed Zeidan
committed
Merge remote-tracking branch 'upstream/main'
2 parents 512b0b3 + 935a4d9 commit f03ecf5

File tree

63 files changed

+5876
-793
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+5876
-793
lines changed

.github/workflows/codebuild-ci.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@ name: PR Checks
22
on:
33
pull_request_target:
44
branches:
5-
- "master*"
6-
- "main*"
5+
- "*"
76

87
concurrency:
98
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.head_ref }}

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Changelog
22

3+
## v3.1.0 (2025-08-13)
4+
5+
### Features
6+
7+
* Task Governance feature for training jobs.
8+
39
## v3.0.2 (2025-07-31)
410

511
### Features

README.md

Lines changed: 8 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -54,24 +54,13 @@ SageMaker HyperPod CLI currently supports start training job with:
5454

5555
1. Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
5656

57-
1. Install ```helm```.
58-
59-
The SageMaker Hyperpod CLI uses Helm to start training jobs. See also the [Helm installation guide](https://helm.sh/docs/intro/install/).
60-
61-
```
62-
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
63-
chmod 700 get_helm.sh
64-
./get_helm.sh
65-
rm -f ./get_helm.sh
66-
```
67-
68-
1. Clone and install the sagemaker-hyperpod-cli package.
57+
2. Install the sagemaker-hyperpod-cli package.
6958

7059
```
7160
pip install sagemaker-hyperpod
7261
```
7362
74-
1. Verify if the installation succeeded by running the following command.
63+
3. Verify if the installation succeeded by running the following command.
7564
7665
```
7766
hyp --help
@@ -171,7 +160,7 @@ hyp create hyp-pytorch-job \
171160
--priority "high" \
172161
--max-retry 3 \
173162
--volume name=model-data,type=hostPath,mount_path=/data,path=/data \
174-
--volume name=training-output,type=pvc,mount_path=/data,claim_name=my-pvc,read_only=false
163+
--volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false
175164
```
176165
177166
Key required parameters explained:
@@ -192,7 +181,6 @@ hyp create hyp-jumpstart-endpoint \
192181
--model-id jumpstart-model-id\
193182
--instance-type ml.g5.8xlarge \
194183
--endpoint-name endpoint-jumpstart \
195-
--tls-output-s3-uri s3://sample-bucket
196184
```
197185
198186
@@ -208,7 +196,7 @@ hyp invoke hyp-jumpstart-endpoint \
208196
209197
```
210198
hyp list hyp-jumpstart-endpoint
211-
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
199+
hyp describe hyp-jumpstart-endpoint --name endpoint-jumpstart
212200
```
213201
214202
#### Creating a Custom Inference Endpoint
@@ -219,7 +207,8 @@ hyp create hyp-custom-endpoint \
219207
--endpoint-name my-custom-endpoint \
220208
--model-name my-pytorch-model \
221209
--model-source-type s3 \
222-
--model-location my-pytorch-training/model.tar.gz \
210+
--model-location my-pytorch-training \
211+
--model-volume-mount-name test-volume \
223212
--s3-bucket-name your-bucket \
224213
--s3-region us-east-1 \
225214
--instance-type ml.g5.8xlarge \
@@ -333,20 +322,17 @@ from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Mod
333322
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
334323

335324
model=Model(
336-
model_id='deepseek-llm-r1-distill-qwen-1-5b',
337-
model_version='2.0.4',
325+
model_id='deepseek-llm-r1-distill-qwen-1-5b'
338326
)
339327
server=Server(
340328
instance_type='ml.g5.8xlarge',
341329
)
342330
endpoint_name=SageMakerEndpoint(name='<my-endpoint-name>')
343-
tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket>')
344331

345332
js_endpoint=HPJumpStartEndpoint(
346333
model=model,
347334
server=server,
348-
sage_maker_endpoint=endpoint_name,
349-
tls_config=tls_config,
335+
sage_maker_endpoint=endpoint_name
350336
)
351337

352338
js_endpoint.create()

doc/cli_training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ hyp create hyp-pytorch-job [OPTIONS]
4040
- `--tasks-per-node INTEGER`: Number of tasks per node (minimum: 1)
4141
- `--label-selector OBJECT`: Node label selector as key-value pairs
4242
- `--deep-health-check-passed-nodes-only BOOLEAN`: Schedule pods only on nodes that passed deep health check (default: false)
43-
- `--scheduler-type TEXT`: Scheduler type
43+
- `--scheduler-type TEXT`: If specified, training job pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.
4444
- `--queue-name TEXT`: Queue name for job scheduling (1-63 characters, alphanumeric with hyphens)
4545
- `--priority TEXT`: Priority class for job scheduling
4646
- `--max-retry INTEGER`: Maximum number of job retries (minimum: 0)

doc/inference.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,7 @@ from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Mod
3737
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
3838
3939
model = Model(
40-
model_id="deepseek-llm-r1-distill-qwen-1-5b",
41-
model_version="2.0.4"
40+
model_id="deepseek-llm-r1-distill-qwen-1-5b"
4241
)
4342
4443
server = Server(
@@ -47,13 +46,10 @@ server = Server(
4746
4847
endpoint_name = SageMakerEndpoint(name="endpoint-jumpstart")
4948
50-
tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://sample-bucket")
51-
5249
js_endpoint = HPJumpStartEndpoint(
5350
model=model,
5451
server=server,
55-
sage_maker_endpoint=endpoint_name,
56-
tls_config=tls_config
52+
sage_maker_endpoint=endpoint_name
5753
)
5854
5955
js_endpoint.create()
@@ -85,7 +81,7 @@ from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
8581
8682
model = Model(
8783
model_source_type="s3",
88-
model_location="test-pytorch-job/model.tar.gz",
84+
model_location="test-pytorch-job",
8985
s3_bucket_name="my-bucket",
9086
s3_region="us-east-2",
9187
prefetch_enabled=True

examples/inference/SDK/inference-jumpstart-e2e.ipynb

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,21 +107,18 @@
107107
"source": [
108108
"# create configs\n",
109109
"model=Model(\n",
110-
" model_id='deepseek-llm-r1-distill-qwen-1-5b',\n",
111-
" model_version='2.0.4',\n",
110+
" model_id='deepseek-llm-r1-distill-qwen-1-5b'\n",
112111
")\n",
113112
"server=Server(\n",
114113
" instance_type='ml.g5.8xlarge',\n",
115114
")\n",
116115
"endpoint_name=SageMakerEndpoint(name='<my-endpoint-name>')\n",
117-
"tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket>')\n",
118116
"\n",
119117
"# create spec\n",
120118
"js_endpoint=HPJumpStartEndpoint(\n",
121119
" model=model,\n",
122120
" server=server,\n",
123-
" sage_maker_endpoint=endpoint_name,\n",
124-
" tls_config=tls_config,\n",
121+
" sage_maker_endpoint=endpoint_name\n",
125122
")"
126123
]
127124
},

helm_chart/HyperPodHelmChart/Chart.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ version: 0.1.0
2424
appVersion: "1.16.0"
2525

2626
dependencies:
27+
- name: cert-manager
28+
version: "v1.18.2"
29+
repository: oci://quay.io/jetstack/charts
30+
condition: cert-manager.enabled
2731
- name: training-operators
2832
version: "0.1.0"
2933
repository: "file://charts/training-operators"

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/_helpers.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Generate the health monitoring agent image URI based on AWS region
5555
*/}}
5656
{{- define "health-monitoring-agent.imageUri" -}}
5757
{{- $region := "" -}}
58-
{{- $imageTag := .Values.imageTag | default "1.0.674.0_1.0.199.0" -}}
58+
{{- $imageTag := .Values.imageTag | default "1.0.742.0_1.0.241.0" -}}
5959

6060
{{/* Debug: Show image tag selection if debug is enabled */}}
6161
{{- if .Values.debug -}}

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/health-monitoring-agent.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ spec:
111111
- ml.g6e.48xlarge
112112
- ml.trn2.48xlarge
113113
- ml.p6-b200.48xlarge
114+
- ml.p6e-gb200.36xlarge
114115
containers:
115116
- name: health-monitoring-agent
116117
args:

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ imageTag: ""
2525

2626
# Override the health monitoring agent image URI
2727
# If specified, this will override the automatic region-based URI selection
28-
# Example: "905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.674.0_1.0.199.0"
28+
# Example: "905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.742.0_1.0.241.0"
2929
hmaimage: ""
3030

3131
# Enable debug output for region selection process

0 commit comments

Comments
 (0)