Skip to content

Commit 30dac6a

Browse files
committed
Add CriticalAddonsOnly toleration for all dependencies
1 parent 84963ab commit 30dac6a

File tree

4 files changed

+11
-2
lines changed

4 files changed

+11
-2
lines changed

helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/templates/health-monitoring-agent.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,5 @@ spec:
164164
operator: Exists
165165
- effect: NoExecute
166166
operator: Exists
167+
- key: CriticalAddonsOnly
168+
operator: Exists

helm_chart/HyperPodHelmChart/charts/mpi-operator/values.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ mpiOperator:
2222
## Tolerations for pod assignment
2323
## Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
2424
tolerations:
25+
- key: CriticalAddonsOnly
26+
operator: Exists
2527
- key: sagemaker.amazonaws.com/node-health-status
2628
operator: "Equal"
2729
value: "Unschedulable"
@@ -35,4 +37,4 @@ mpiOperator:
3537
imagePullPolicy: IfNotPresent
3638

3739
## Apply extra labels to all created resources
38-
extraLabels: {}
40+
extraLabels: {}

helm_chart/HyperPodHelmChart/charts/training-operators/templates/Deployment/training-operator-kubeflow-Deployment.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,5 +54,8 @@ spec:
5454
timeoutSeconds: 3
5555
securityContext:
5656
allowPrivilegeEscalation: false
57+
tolerations:
58+
- key: CriticalAddonsOnly
59+
operator: Exists
5760
serviceAccountName: training-operator
58-
terminationGracePeriodSeconds: 10
61+
terminationGracePeriodSeconds: 10

helm_chart/HyperPodHelmChart/values.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,8 @@ nvidia-device-plugin:
180180
operator: Equal
181181
value: Unschedulable
182182
effect: NoSchedule
183+
- key: CriticalAddonsOnly
184+
operator: Exists
183185

184186
neuron-device-plugin:
185187
devicePlugin:

0 commit comments

Comments
 (0)