Skip to content

Commit f940caf

Browse files
author
Mohamed Zeidan
committed
Merge remote-tracking branch 'upstream/main'
2 parents 835e21f + 88bfd93 commit f940caf

File tree

18 files changed

+2001
-54
lines changed

18 files changed

+2001
-54
lines changed

.github/pull_request_template.md

Lines changed: 14 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,22 +14,17 @@
1414
<!-- Describe your testing approach -->
1515

1616

17-
## Unit test coverage
18-
<!-- Check unit test coverage for your changes -->
19-
- [ ] All new/modified code has unit tests
20-
- [ ] Coverage verified for changed code
21-
- [ ] N/A - no testable code changes
22-
23-
## Do we need integration tests?
24-
<!-- Consider if integration tests are needed -->
25-
- [ ] Yes - integration tests added
26-
- [ ] No - unit tests sufficient
27-
- [ ] No - infrastructure/config change only
28-
- [ ] Unsure - please advise
29-
30-
---
31-
32-
## Checklist
33-
- [ ] PR title clearly describes the change
34-
- [ ] No sensitive information exposed and security is maintained
35-
- [ ] Ready for review
17+
## Are unit tests added?
18+
19+
20+
## Are integration tests added?
21+
22+
23+
## Reviewer Guidelines
24+
25+
‼️ **Merge Requirements**: PRs with failing integration tests cannot be merged without justification.
26+
27+
One of the following must be true:
28+
- [ ] All automated PR checks pass
29+
- [ ] Failed tests include local run results/screenshots proving they work
30+
- [ ] Changes are documentation-only

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,11 @@ doc/_build/
2222
/sagemaker-hyperpod/build
2323
/sagemaker-hyperpod/.coverage
2424
/sagemaker-hyperpod/.coverage.*
25+
2526
/hyperpod-cluster-stack-template/build
27+
/hyperpod-pytorch-job-template/build
28+
/hyperpod-custom-inference-template/build
29+
/hyperpod-jumpstart-inference-template/build
2630

2731
# Ignore all contents of result and results directories
2832
/result/

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ hyp list-cluster
106106
|--------|------|-------------|
107107
| `--region <region>` | Optional | The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials. |
108108
| `--namespace <namespace>` | Optional | The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported. |
109-
| `--output <json|table>` | Optional | The output format. Available values are `table` and `json`. The default value is `json`. |
109+
| `--output <json\|table>` | Optional | The output format. Available values are `table` and `json`. The default value is `json`. |
110110
| `--debug` | Optional | Enable debug mode for detailed logging. |
111111

112112
### Connecting to a Cluster

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/_helpers.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Generate the health monitoring agent image URI based on AWS region
5555
*/}}
5656
{{- define "health-monitoring-agent.imageUri" -}}
5757
{{- $region := "" -}}
58-
{{- $imageTag := .Values.imageTag | default "1.0.790.0_1.0.266.0" -}}
58+
{{- $imageTag := .Values.imageTag | default "1.0.819.0_1.0.267.0" -}}
5959

6060
{{/* Debug: Show image tag selection if debug is enabled */}}
6161
{{- if .Values.debug -}}

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/health-monitoring-agent.yaml

Lines changed: 90 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -85,12 +85,6 @@ spec:
8585
- ml.g5.16xlarge
8686
- ml.g5.24xlarge
8787
- ml.g5.48xlarge
88-
- ml.inf2.xlarge
89-
- ml.inf2.8xlarge
90-
- ml.inf2.24xlarge
91-
- ml.inf2.48xlarge
92-
- ml.trn1.32xlarge
93-
- ml.trn1n.32xlarge
9488
- ml.g6.xlarge
9589
- ml.g6.2xlarge
9690
- ml.g6.4xlarge
@@ -109,7 +103,6 @@ spec:
109103
- ml.g6e.12xlarge
110104
- ml.g6e.24xlarge
111105
- ml.g6e.48xlarge
112-
- ml.trn2.48xlarge
113106
- ml.p6-b200.48xlarge
114107
- ml.p6e-gb200.36xlarge
115108
containers:
@@ -166,3 +159,93 @@ spec:
166159
operator: Exists
167160
- effect: NoExecute
168161
operator: Exists
162+
---
163+
apiVersion: apps/v1
164+
kind: DaemonSet
165+
metadata:
166+
name: health-monitoring-agent-non-nvidia
167+
namespace: {{ .Values.namespace }}
168+
labels:
169+
app: health-monitoring-agent-non-nvidia
170+
spec:
171+
selector:
172+
matchLabels:
173+
app: health-monitoring-agent-non-nvidia
174+
template:
175+
metadata:
176+
labels:
177+
app: health-monitoring-agent-non-nvidia
178+
spec:
179+
affinity:
180+
nodeAffinity:
181+
requiredDuringSchedulingIgnoredDuringExecution:
182+
nodeSelectorTerms:
183+
- matchExpressions:
184+
- key: node.kubernetes.io/instance-type
185+
operator: In
186+
values:
187+
- ml.inf2.xlarge
188+
- ml.inf2.8xlarge
189+
- ml.inf2.24xlarge
190+
- ml.inf2.48xlarge
191+
- ml.trn1.32xlarge
192+
- ml.trn1n.32xlarge
193+
- ml.trn2.48xlarge
194+
containers:
195+
- name: health-monitoring-agent-non-nvidia
196+
args:
197+
- --enable-k8s-exporter=false
198+
- --config.system-log-monitor=/config/system-message-monitor.json
199+
image: {{ include "health-monitoring-agent.imageUri" . }}
200+
resources:
201+
limits:
202+
cpu: 500m
203+
memory: 512Mi
204+
requests:
205+
cpu: 500m
206+
memory: 512Mi
207+
imagePullPolicy: IfNotPresent
208+
securityContext:
209+
runAsUser: 1000
210+
runAsGroup: 2000
211+
env:
212+
- name: NODE_NAME
213+
valueFrom:
214+
fieldRef:
215+
fieldPath: spec.nodeName
216+
- name: NODE_IP
217+
valueFrom:
218+
fieldRef:
219+
fieldPath: status.hostIP
220+
- name: NVIDIA_VISIBLE_DEVICES
221+
value: "void"
222+
- name: NVIDIA_DRIVER_CAPABILITIES
223+
value: ""
224+
volumeMounts:
225+
- name: log
226+
mountPath: /var/log
227+
- name: kmsg
228+
mountPath: /dev/kmsg
229+
readOnly: true
230+
# Make sure node problem detector is in the same timezone
231+
# with the host.
232+
- name: localtime
233+
mountPath: /etc/localtime
234+
readOnly: true
235+
serviceAccountName: health-monitoring-agent
236+
volumes:
237+
- name: log
238+
# Config `log` to your system log directory
239+
hostPath:
240+
path: /var/log/
241+
- name: kmsg
242+
hostPath:
243+
path: /dev/kmsg
244+
- name: localtime
245+
hostPath:
246+
path: /etc/localtime
247+
tolerations:
248+
- effect: NoSchedule
249+
operator: Exists
250+
- effect: NoExecute
251+
operator: Exists

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ imageTag: ""
2525

2626
# Override the health monitoring agent image URI
2727
# If specified, this will override the automatic region-based URI selection
28-
# Example: "905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0"
28+
# Example: "905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0"
2929
hmaimage: ""
3030

3131
# Enable debug output for region selection process

helm_chart/readme.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -234,19 +234,19 @@ helm upgrade dependencies helm_chart/HyperPodHelmChart --namespace kube-system
234234

235235
- **Supported Regions and their ECR URIs**:
236236
```
237-
us-east-1 (US East (N. Virginia)): 767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
238-
us-west-2 (US West (Oregon)): 905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
239-
us-east-2 (US East (Ohio)): 851725546812.dkr.ecr.us-east-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
240-
us-west-1 (US West (N. California)): 011528288828.dkr.ecr.us-west-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
241-
eu-central-1 (Europe (Frankfurt)): 211125453373.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
242-
eu-north-1 (Europe (Stockholm)): 654654141839.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
243-
eu-west-1 (Europe (Ireland)): 533267293120.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
244-
eu-west-2 (Europe (London)): 011528288831.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
245-
ap-northeast-1 (Asia Pacific (Tokyo)): 533267052152.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
246-
ap-south-1 (Asia Pacific (Mumbai)): 011528288864.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
247-
ap-southeast-1 (Asia Pacific (Singapore)): 905418428165.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
248-
ap-southeast-2 (Asia Pacific (Sydney)): 851725636348.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
249-
sa-east-1 (South America (São Paulo)): 025066253954.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.790.0_1.0.266.0
237+
us-east-1 (US East (N. Virginia)): 767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
238+
us-west-2 (US West (Oregon)): 905418368575.dkr.ecr.us-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
239+
us-east-2 (US East (Ohio)): 851725546812.dkr.ecr.us-east-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
240+
us-west-1 (US West (N. California)): 011528288828.dkr.ecr.us-west-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
241+
eu-central-1 (Europe (Frankfurt)): 211125453373.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
242+
eu-north-1 (Europe (Stockholm)): 654654141839.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
243+
eu-west-1 (Europe (Ireland)): 533267293120.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
244+
eu-west-2 (Europe (London)): 011528288831.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
245+
ap-northeast-1 (Asia Pacific (Tokyo)): 533267052152.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
246+
ap-south-1 (Asia Pacific (Mumbai)): 011528288864.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
247+
ap-southeast-1 (Asia Pacific (Singapore)): 905418428165.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
248+
ap-southeast-2 (Asia Pacific (Sydney)): 851725636348.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
249+
sa-east-1 (South America (São Paulo)): 025066253954.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.819.0_1.0.267.0
250250
```
251251

252252
## 7. Troubleshooting

setup.cfg

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ xfail_strict = true
5050
addopts =
5151
--verbose
5252
--ignore=build/private
53-
--cov hyperpod_cli
53+
--cov sagemaker.hyperpod
5454
--cov-config setup.cfg
5555
--cov-report term-missing
5656
--cov-report html:build/hyperpod-documentation/coverage
@@ -59,8 +59,8 @@ addopts =
5959
--durations=5
6060
# Default to colorful output
6161
--color=yes
62-
# Uncomment to enforce a minimum code coverage threshold.
63-
# --cov-fail-under 50
62+
# Enforce a minimum code coverage threshold
63+
--cov-fail-under 50
6464
testpaths = test
6565
looponfailroots = src test
6666

0 commit comments

Comments
 (0)