Skip to content

Commit 1d16fbb

Browse files
Merge pull request #277881 from schaffererin/bug262390
Updated DaemonSet YAML with priorityClassName
2 parents e80675f + 7d4c33b commit 1d16fbb

File tree

1 file changed

+16
-21
lines changed

1 file changed

+16
-21
lines changed

articles/aks/gpu-cluster.md

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
164164
kind: DaemonSet
165165
metadata:
166166
name: nvidia-device-plugin-daemonset
167-
namespace: gpu-resources
167+
namespace: kube-system
168168
spec:
169169
selector:
170170
matchLabels:
@@ -173,40 +173,35 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
173173
type: RollingUpdate
174174
template:
175175
metadata:
176-
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
177-
# reserves resources for critical add-on pods so that they can be rescheduled after
178-
# a failure. This annotation works in tandem with the toleration below.
179-
annotations:
180-
scheduler.alpha.kubernetes.io/critical-pod: ""
181176
labels:
182177
name: nvidia-device-plugin-ds
183178
spec:
184179
tolerations:
185-
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
186-
# This, along with the annotation above marks this pod as a critical add-on.
187-
- key: CriticalAddonsOnly
188-
operator: Exists
189180
- key: nvidia.com/gpu
190181
operator: Exists
191182
effect: NoSchedule
192-
- key: "sku"
193-
operator: "Equal"
194-
value: "gpu"
195-
effect: "NoSchedule"
183+
# Mark this pod as a critical add-on; when enabled, the critical add-on
184+
# scheduler reserves resources for critical add-on pods so that they can
185+
# be rescheduled after a failure.
186+
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
187+
priorityClassName: "system-node-critical"
196188
containers:
197-
- image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
189+
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
198190
name: nvidia-device-plugin-ctr
191+
env:
192+
- name: FAIL_ON_INIT_ERROR
193+
value: "false"
199194
securityContext:
200195
allowPrivilegeEscalation: false
201196
capabilities:
202197
drop: ["ALL"]
203198
volumeMounts:
204-
- name: device-plugin
205-
mountPath: /var/lib/kubelet/device-plugins
206-
volumes:
207199
- name: device-plugin
208-
hostPath:
209-
path: /var/lib/kubelet/device-plugins
200+
mountPath: /var/lib/kubelet/device-plugins
201+
volumes:
202+
- name: device-plugin
203+
hostPath:
204+
path: /var/lib/kubelet/device-plugins
210205
```
211206
212207
3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
@@ -499,7 +494,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
499494
[kubectl-create]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#create
500495
[azure-pricing]: https://azure.microsoft.com/pricing/
501496
[azure-availability]: https://azure.microsoft.com/global-infrastructure/services/
502-
[nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin
497+
[nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin/blob/4b3d6b0a6613a3672f71ea4719fd8633eaafb4f3/deployments/static/nvidia-device-plugin.yml
503498
504499
<!-- LINKS - internal -->
505500
[az-aks-create]: /cli/azure/aks#az_aks_create

0 commit comments

Comments
 (0)