Skip to content

Commit 4924d38

Browse files
Merge pull request #265337 from schaffererin/windowsGPUreview
windowsGPU PR review
2 parents d4fc9f0 + eda56b0 commit 4924d38

File tree

2 files changed

+345
-5
lines changed

2 files changed

+345
-5
lines changed

articles/aks/TOC.yml

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -288,8 +288,6 @@
288288
href: artifact-streaming.md
289289
- name: Add an Azure Spot node pool
290290
href: spot-node-pool.md
291-
- name: Multi-instance GPU node pool
292-
href: gpu-multi-instance.md
293291
- name: Node pool snapshot
294292
href: node-pool-snapshot.md
295293
- name: Use system node pools
@@ -305,7 +303,31 @@
305303
- name: Use the Azure portal
306304
href: virtual-nodes-portal.md
307305
- name: Workloads
308-
items:
306+
items:
307+
- name: GPU workloads
308+
items:
309+
- name: Use GPUs
310+
href: gpu-cluster.md
311+
- name: Use Windows GPUs
312+
href: use-windows-gpu.md
313+
- name: Multi-instance GPU node pool
314+
href: gpu-multi-instance.md
315+
- name: Vertical Pod Autoscaler
316+
items:
317+
- name: About Vertical Pod Autoscaler
318+
href: vertical-pod-autoscaler.md
319+
- name: Vertical Pod Autoscaler API reference
320+
href: vertical-pod-autoscaler-api-reference.md
321+
- name: Configure Metrics Server VPA
322+
href: use-metrics-server-vertical-pod-autoscaler.md
323+
- name: Proximity placement groups
324+
href: reduce-latency-ppg.md
325+
- name: Cluster autoscaler
326+
items:
327+
- name: Cluster autoscaler overview
328+
href: cluster-autoscaler-overview.md
329+
- name: Use the cluster autoscaler on AKS
330+
href: cluster-autoscaler.md
309331
- name: Node autoprovision
310332
href: node-autoprovision.md
311333
- name: Availability zones
@@ -744,8 +766,6 @@
744766
href: /visualstudio/bridge/bridge-to-kubernetes-vs?toc=/azure/aks/toc.json&bc=/azure/aks/breadcrumb/toc.json
745767
- name: Use OpenFaaS
746768
href: openfaas.md
747-
- name: Use GPUs
748-
href: gpu-cluster.md
749769
- name: Create containerized app with Draft
750770
href: draft.md
751771
- name: Build Django app with PostgreSQL

articles/aks/use-windows-gpu.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
---
2+
title: Use GPUs for Windows node pools on Azure Kubernetes Service (AKS)
3+
description: Learn how to use Windows GPUs for high performance compute or graphics-intensive workloads on Azure Kubernetes Service (AKS).
4+
ms.topic: article
5+
ms.date: 03/18/2024
6+
#Customer intent: As a cluster administrator or developer, I want to create an AKS cluster that can use high-performance GPU-based VMs for compute-intensive workloads using a Windows os.
7+
---
8+
9+
# Use Windows GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)
10+
11+
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Windows and [Linux](./gpu-cluster.md) node pools to run compute-intensive Kubernetes workloads.
12+
13+
This article helps you provision Windows nodes with schedulable GPUs on new and existing AKS clusters.
14+
15+
## Supported GPU-enabled virtual machines (VMs)
16+
17+
To view supported GPU-enabled VMs, see [GPU-optimized VM sizes in Azure][gpu-skus]. For AKS node pools, we recommend a minimum size of *Standard_NC6s_v3*. The NVv4 series (based on AMD GPUs) aren't supported on AKS.
18+
19+
> [!NOTE]
20+
> GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].
21+
22+
## Limitations
23+
24+
* Updating an existing Windows node pool to add GPU isn't supported.
25+
* Not supported on Kubernetes version 1.28 and below.
26+
27+
## Before you begin
28+
29+
* This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the [Azure CLI][aks-quickstart-cli], [Azure PowerShell][aks-quickstart-powershell], or the [Azure portal][aks-quickstart-portal].
30+
* You need the Azure CLI version 1.0.0b2 or later installed and configured to use the `--skip-gpu-driver-install` field with the `az aks nodepool add` command. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
31+
32+
## Get the credentials for your cluster
33+
34+
* Get the credentials for your AKS cluster using the [`az aks get-credentials`][az-aks-get-credentials] command. The following example command gets the credentials for the *myAKSCluster* in the *myResourceGroup* resource group:
35+
36+
```azurecli-interactive
37+
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
38+
```
39+
40+
## Using Windows GPU with automatic driver installation
41+
42+
Using NVIDIA GPUs involves the installation of various NVIDIA software components such as the [DirectX device plugin for Kubernetes](https://github.com/aarnaud/k8s-directx-device-plugin), GPU driver installation, and more. When you create a Windows node pool with a supported GPU-enabled VM, these components and the appropriate NVIDIA CUDA or GRID drivers are installed. For NC and ND series VM sizes, the CUDA driver is installed. For NV series VM sizes, the GRID driver is installed.
43+
44+
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
45+
46+
### Install the `aks-preview` Azure CLI extension
47+
48+
* Register or update the aks-preview extension using the [`az extension add`][az-extension-add] or [`az extension update`][az-extension-update] command.
49+
50+
```azurecli-interactive
51+
# Register the aks-preview extension
52+
az extension add --name aks-preview
53+
54+
# Update the aks-preview extension
55+
az extension update --name aks-preview
56+
```
57+
58+
### Register the `WindowsGPUPreview` feature flag
59+
60+
1. Register the `WindowsGPUPreview` feature flag using the [`az feature register`][az-feature-register] command.
61+
62+
```azurecli-interactive
63+
az feature register --namespace "Microsoft.ContainerService" --name "WindowsGPUPreview"
64+
```
65+
66+
It takes a few minutes for the status to show *Registered*.
67+
68+
2. Verify the registration status using the [`az feature show`][az-feature-show] command.
69+
70+
```azurecli-interactive
71+
az feature show --namespace "Microsoft.ContainerService" --name "WindowsGPUPreview"
72+
```
73+
74+
3. When the status reflects *Registered*, refresh the registration of the *Microsoft.ContainerService* resource provider using the [`az provider register`][az-provider-register] command.
75+
76+
```azurecli-interactive
77+
az provider register --namespace Microsoft.ContainerService
78+
```
79+
80+
### Create a Windows GPU-enabled node pool (preview)
81+
82+
To create a Windows GPU-enabled node pool, you need to use a supported GPU-enabled VM size and specify the `os-type` as `Windows`. The default Windows `os-sku` is `Windows2022`, but all Windows `os-sku` options are supported.
83+
84+
1. Create a Windows GPU-enabled node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
85+
86+
```azurecli-interactive
87+
az aks nodepool add \
88+
--resource-group myResourceGroup \
89+
--cluster-name myAKSCluster \
90+
--name gpunp \
91+
--node-count 1 \
92+
--os-type Windows \
93+
--kubernetes-version 1.29.0 \
94+
--node-vm-size Standard_NC6s_v3
95+
```
96+
97+
2. Check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable).
98+
3. Once you confirm that your GPUs are schedulable, you can run your GPU workload.
99+
100+
## Using Windows GPU with manual driver installation
101+
102+
When creating a Windows node pool with N-series (NVIDIA GPU) VM sizes in AKS, the GPU driver and Kubernetes DirectX device plugin are installed automatically. To bypass this automatic installation, use the following steps:
103+
104+
1. [Skip GPU driver installation (preview)](#skip-gpu-driver-installation-preview) using `--skip-gpu-driver-install`.
105+
2. [Manual installation of the Kubernetes DirectX device plugin](#manually-install-the-kubernetes-directx-device-plugin).
106+
107+
### Skip GPU driver installation (preview)
108+
109+
AKS has automatic GPU driver installation enabled by default. In some cases, such as installing your own drivers, you may want to skip GPU driver installation.
110+
111+
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
112+
113+
1. Register or update the aks-preview extension using the [`az extension add`][az-extension-add] or [`az extension update`][az-extension-update] command.
114+
115+
```azurecli-interactive
116+
# Register the aks-preview extension
117+
az extension add --name aks-preview
118+
119+
# Update the aks-preview extension
120+
az extension update --name aks-preview
121+
```
122+
123+
2. Create a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command with the `--skip-gpu-driver-install` flag to skip automatic GPU driver installation.
124+
125+
```azurecli-interactive
126+
az aks nodepool add \
127+
--resource-group myResourceGroup \
128+
--cluster-name myAKSCluster \
129+
--name gpunp \
130+
--node-count 1 \
131+
--os-type windows \
132+
--os-sku windows2022 \
133+
--skip-gpu-driver-install
134+
```
135+
136+
> [!NOTE]
137+
> If the `--node-vm-size` that you're using isn't yet onboarded on AKS, you can't use GPUs and `--skip-gpu-driver-install` doesn't work.
138+
139+
### Manually install the Kubernetes DirectX device plugin
140+
141+
You can deploy a DaemonSet for the Kubernetes DirectX device plugin, which runs a pod on each node to provide the required drivers for the GPUs.
142+
143+
* Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
144+
145+
```azurecli-interactive
146+
az aks nodepool add \
147+
--resource-group myResourceGroup \
148+
--cluster-name myAKSCluster \
149+
--name gpunp \
150+
--node-count 1 \
151+
--os-type windows \
152+
--os-sku windows2022
153+
```
154+
155+
## Create a namespace and deploy the Kubernetes DirectX device plugin
156+
157+
1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
158+
159+
```bash
160+
kubectl create namespace gpu-resources
161+
```
162+
163+
2. Create a file named *k8s-directx-device-plugin.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
164+
165+
```yaml
166+
apiVersion: apps/v1
167+
kind: DaemonSet
168+
metadata:
169+
name: nvidia-device-plugin-daemonset
170+
namespace: gpu-resources
171+
spec:
172+
selector:
173+
matchLabels:
174+
name: nvidia-device-plugin-ds
175+
updateStrategy:
176+
type: RollingUpdate
177+
template:
178+
metadata:
179+
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
180+
# reserves resources for critical add-on pods so that they can be rescheduled after
181+
# a failure. This annotation works in tandem with the toleration below.
182+
annotations:
183+
scheduler.alpha.kubernetes.io/critical-pod: ""
184+
labels:
185+
name: nvidia-device-plugin-ds
186+
spec:
187+
tolerations:
188+
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
189+
# This, along with the annotation above marks this pod as a critical add-on.
190+
- key: CriticalAddonsOnly
191+
operator: Exists
192+
- key: nvidia.com/gpu
193+
operator: Exists
194+
effect: NoSchedule
195+
- key: "sku"
196+
operator: "Equal"
197+
value: "gpu"
198+
effect: "NoSchedule"
199+
containers:
200+
- image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
201+
name: nvidia-device-plugin-ctr
202+
securityContext:
203+
allowPrivilegeEscalation: false
204+
capabilities:
205+
drop: ["ALL"]
206+
volumeMounts:
207+
- name: device-plugin
208+
mountPath: /var/lib/kubelet/device-plugins
209+
volumes:
210+
- name: device-plugin
211+
hostPath:
212+
path: /var/lib/kubelet/device-plugins
213+
```
214+
215+
3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
216+
217+
```bash
218+
kubectl apply -f nvidia-device-plugin-ds.yaml
219+
```
220+
221+
4. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable).
222+
223+
## Confirm that GPUs are schedulable
224+
225+
After creating your cluster, confirm that GPUs are schedulable in Kubernetes.
226+
227+
1. List the nodes in your cluster using the [`kubectl get nodes`][kubectl-get] command.
228+
229+
```console
230+
kubectl get nodes
231+
```
232+
233+
Your output should look similar to the following example output:
234+
235+
```console
236+
NAME STATUS ROLES AGE VERSION
237+
aks-gpunp-28993262-0 Ready agent 13m v1.20.7
238+
```
239+
240+
2. Confirm the GPUs are schedulable using the [`kubectl describe node`][kubectl-describe] command.
241+
242+
```console
243+
kubectl describe node aks-gpunp-28993262-0
244+
```
245+
246+
Under the *Capacity* section, the GPU should list as `microsoft.com/directx: 1`. Your output should look similar to the following condensed example output:
247+
248+
```output
249+
Capacity:
250+
[...]
251+
microsoft.com.directx/gpu: 1
252+
[...]
253+
```
254+
255+
## Use Container Insights to monitor GPU usage
256+
257+
[Container Insights with AKS][aks-container-insights] monitors the following GPU usage metrics:
258+
259+
| Metric name | Metric dimension (tags) | Description |
260+
|-------------|-------------------------|-------------|
261+
| containerGpuDutyCycle | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`| Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100. |
262+
| containerGpuLimits | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can specify limits as one or more GPUs. It's not possible to request or limit a fraction of a GPU. |
263+
| containerGpuRequests | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can request one or more GPUs. It's not possible to request or limit a fraction of a GPU. |
264+
| containerGpumemoryTotalBytes | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor` | Amount of GPU Memory in bytes available to use for a specific container. |
265+
| containerGpumemoryUsedBytes | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor` | Amount of GPU Memory in bytes used by a specific container. |
266+
| nodeGpuAllocatable | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `gpuVendor` | Number of GPUs in a node that Kubernetes can use.|
267+
| nodeGpuCapacity | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `gpuVendor` | Total Number of GPUs in a node. |
268+
269+
## Clean up resources
270+
271+
* Remove the associated Kubernetes objects you created in this article using the [`kubectl delete job`][kubectl delete] command.
272+
273+
```console
274+
kubectl delete jobs windows-gpu-workload
275+
```
276+
277+
## Next steps
278+
279+
* To run Apache Spark jobs, see [Run Apache Spark jobs on AKS][aks-spark].
280+
* For more information on features of the Kubernetes scheduler, see [Best practices for advanced scheduler features in AKS][advanced-scheduler-aks].
281+
* For more information on Azure Kubernetes Service and Azure Machine Learning, see:
282+
* [Configure a Kubernetes cluster for ML model training or deployment][azureml-aks].
283+
* [Deploy a model with an online endpoint][azureml-deploy].
284+
* [High-performance serving with Triton Inference Server][azureml-triton].
285+
* [Labs for Kubernetes and Kubeflow][kubeflow].
286+
287+
<!-- LINKS - external -->
288+
[kubectl-apply]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply
289+
[kubectl-get]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#get
290+
[kubeflow]: https://github.com/Azure/kubeflow-labs
291+
[kubectl-describe]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe
292+
[kubectl-logs]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs
293+
[kubectl delete]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete
294+
[kubectl-create]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#create
295+
[azure-pricing]: https://azure.microsoft.com/pricing/
296+
[azure-availability]: https://azure.microsoft.com/global-infrastructure/services/
297+
[nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin
298+
299+
<!-- LINKS - internal -->
300+
[az-aks-create]: /cli/azure/aks#az_aks_create
301+
[az-aks-nodepool-update]: /cli/azure/aks/nodepool#az_aks_nodepool_update
302+
[az-aks-nodepool-add]: /cli/azure/aks/nodepool#az_aks_nodepool_add
303+
[az-aks-get-credentials]: /cli/azure/aks#az_aks_get_credentials
304+
[aks-quickstart-cli]: ./learn/quick-windows-container-deploy-cli.md
305+
[aks-quickstart-portal]: ./learn/quick-windows-container-deploy-portal.md
306+
[aks-quickstart-powershell]: ./learn/quick-windows-container-deploy-powershell.md
307+
[aks-spark]: spark-job.md
308+
[gpu-skus]: ../virtual-machines/sizes-gpu.md
309+
[install-azure-cli]: /cli/azure/install-azure-cli
310+
[azureml-aks]: ../machine-learning/how-to-attach-kubernetes-anywhere.md
311+
[azureml-deploy]: ../machine-learning/how-to-deploy-managed-online-endpoints.md
312+
[azureml-triton]: ../machine-learning/how-to-deploy-with-triton.md
313+
[aks-container-insights]: monitor-aks.md#integrations
314+
[advanced-scheduler-aks]: operator-best-practices-advanced-scheduler.md
315+
[az-provider-register]: /cli/azure/provider#az-provider-register
316+
[az-feature-register]: /cli/azure/feature#az-feature-register
317+
[az-feature-show]: /cli/azure/feature#az-feature-show
318+
[az-extension-add]: /cli/azure/extension#az-extension-add
319+
[az-extension-update]: /cli/azure/extension#az-extension-update
320+
[NVadsA10]: /azure/virtual-machines/nva10v5-series

0 commit comments

Comments
 (0)