You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -14,20 +14,23 @@ Graphical processing units (GPUs) are often used for compute-intensive workloads
14
14
This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.
15
15
16
16
## Supported GPU-enabled VMs
17
+
17
18
To view supported GPU-enabled VMs, see [GPU-optimized VM sizes in Azure][gpu-skus]. For AKS node pools, we recommend a minimum size of *Standard_NC6s_v3*. The NVv4 series (based on AMD GPUs) aren't supported on AKS.
18
19
19
20
> [!NOTE]
20
21
> GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].
21
22
22
23
## Limitations
23
-
* AKS does not support Windows GPU-enabled node pools.
24
+
24
25
* If you're using an Azure Linux GPU-enabled node pool, automatic security patches aren't applied, and the default behavior for the cluster is *Unmanaged*. For more information, see [auto-upgrade](./auto-upgrade-node-image.md).
25
-
*[NVadsA10](../virtual-machines/nva10v5-series.md) v5-series are not a recommended SKU for GPU VHD.
26
+
*[NVadsA10](../virtual-machines/nva10v5-series.md) v5-series are *not* a recommended SKU for GPU VHD.
27
+
* AKS doesn't support Windows GPU-enabled node pools.
28
+
* Updating an existing node pool to add GPU isn't supported.
26
29
27
30
## Before you begin
28
31
29
32
* This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the [Azure CLI][aks-quickstart-cli], [Azure PowerShell][aks-quickstart-powershell], or the [Azure portal][aks-quickstart-portal].
30
-
* You also need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
33
+
* You need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
31
34
32
35
## Get the credentials for your cluster
33
36
@@ -39,63 +42,55 @@ To view supported GPU-enabled VMs, see [GPU-optimized VM sizes in Azure][gpu-sku
39
42
40
43
## Options for using NVIDIA GPUs
41
44
42
-
There are three ways to add the NVIDIA device plugin:
43
-
44
-
1. [Using the AKS GPU image](#update-your-cluster-to-use-the-aks-gpu-image-preview)
45
-
2. [Manually installing the NVIDIA device plugin](#manually-install-the-nvidia-device-plugin)
46
-
3. Using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html)
47
-
48
-
### Use NVIDIA GPU Operator with AKS
49
-
You can use the NVIDIA GPU Operator by skipping the gpu driver installation on AKS. For more information about using the NVIDIA GPU Operator with AKS, see [NVIDIA Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html).
50
-
51
-
Adding the node pool tag `SkipGPUDriverInstall=true` will skip installing the GPU driver automatically on newly created nodes in the node pool. Any existing nodes will not be changed - the pool can be scaled to 0 and back up to make the change take effect. You can specify the tag using the `--nodepool-tags` argument to [`az aks create`][az-aks-create] command (for a new cluster) or `--tags` with [`az aks nodepool add`][az-aks-nodepool-add] or [`az aks nodepool update`][az-aks-nodepool-update].
52
-
53
-
> [!WARNING]
54
-
> We don't recommend manually installing the NVIDIA device plugin daemon set with clusters using the AKS GPU image.
45
+
Using NVIDIA GPUs involves the installation of various NVIDIA software components such as the [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file), GPU driver installation, and more.
55
46
56
-
### Update your cluster to use the AKS GPU image (preview)
47
+
### Skip GPU driver installation (preview)
57
48
58
-
AKS provides a fully configured AKS image containing the [NVIDIA device plugin for Kubernetes][nvidia-github].
49
+
AKS has automatic GPU driver installation enabled by default. In some cases, such as installing your own drivers or using the NVIDIA GPU Operator, you may want to skip GPU driver installation.
59
50
60
51
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
61
52
62
-
1. Install the `aks-preview` Azure CLI extension using the [`az extension add`][az-extension-add] command.
53
+
1. Register or update the aks-previewextension using the [`az extension add`][az-extension-add] or [`az extension update`][az-extension-update] command.
63
54
64
55
```azurecli-interactive
56
+
# Register the aks-preview extension
65
57
az extension add --name aks-preview
66
-
```
67
-
68
-
2. Update to the latest version of the extension using the [`az extension update`][az-extension-update] command.
69
58
70
-
```azurecli-interactive
59
+
# Update the aks-preview extension
71
60
az extension update --name aks-preview
72
61
```
73
62
74
-
3. Register the `GPUDedicatedVHDPreview` feature flag using the [`az feature register`][az-feature-register] command.
63
+
2. Create a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command with the `--skip-gpu-driver-install` flag to skip automatic GPU driver installation.
75
64
76
65
```azurecli-interactive
77
-
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
66
+
az aks nodepool add \
67
+
--resource-group myResourceGroup \
68
+
--cluster-name myAKSCluster \
69
+
--name gpunp \
70
+
--node-count 1 \
71
+
--skip-gpu-driver-install \
72
+
--node-vm-size Standard_NC6s_v3 \
73
+
--node-taints sku=gpu:NoSchedule \
74
+
--enable-cluster-autoscaler \
75
+
--min-count 1 \
76
+
--max-count 3
78
77
```
79
78
80
-
It takes a few minutes for the status to show *Registered*.
79
+
Adding the `--skip-gpu-driver-install` flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
81
80
82
-
4. Verify the registration status using the [`az feature show`][az-feature-show] command.
81
+
### NVIDIA device plugin installation
83
82
84
-
```azurecli-interactive
85
-
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
86
-
```
83
+
NVIDIA device plugin installation is required when using GPUs on AKS. In some cases, the installation is handled automatically, such as when using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html) or the [AKS GPU image (preview)](#use-the-aks-gpu-image-preview). Alternatively, you can manually install the NVIDIA device plugin.
87
84
88
-
5. When the status reflects *Registered*, refresh the registration of the *Microsoft.ContainerService* resource provider using the [`az provider register`][az-provider-register] command.
85
+
#### Manually install the NVIDIA device plugin
89
86
90
-
```azurecli-interactive
91
-
az provider register --namespace Microsoft.ContainerService
92
-
```
87
+
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs. This is the recommended approach when using GPU-enabled node pools for Azure Linux.
93
88
94
-
#### Add a node pool for GPU nodes
89
+
##### [Ubuntu Linux node pool (default SKU)](#tab/add-ubuntu-gpu-node-pool)
95
90
96
-
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
91
+
To use the default OS SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.
97
92
98
-
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
93
+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
99
94
100
95
```azurecli-interactive
101
96
az aks nodepool add \
@@ -105,44 +100,43 @@ Now that you updated your cluster to use the AKS GPU image, you can add a node p
105
100
--node-count 1 \
106
101
--node-vm-size Standard_NC6s_v3 \
107
102
--node-taints sku=gpu:NoSchedule \
108
-
--aks-custom-headers UseGPUDedicatedVHD=true \
109
103
--enable-cluster-autoscaler \
110
104
--min-count 1 \
111
105
--max-count 3
112
106
```
113
107
114
-
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
108
+
This command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
115
109
116
-
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
117
-
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
118
-
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
119
-
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
120
-
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
121
-
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
110
+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
111
+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
112
+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
113
+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
114
+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
122
115
123
116
> [!NOTE]
124
-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
117
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
125
118
126
-
### Manually install the NVIDIA device plugin
119
+
##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool)
127
120
128
-
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs.
121
+
To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` during node pool creation. The `os-type` is set to `Linux` by default.
129
122
130
-
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
123
+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command with the `--os-sku` flag set to `AzureLinux`.
131
124
132
125
```azurecli-interactive
133
126
az aks nodepool add \
134
127
--resource-group myResourceGroup \
135
128
--cluster-name myAKSCluster \
136
129
--name gpunp \
137
130
--node-count 1 \
131
+
--os-sku AzureLinux \
138
132
--node-vm-size Standard_NC6s_v3 \
139
133
--node-taints sku=gpu:NoSchedule \
140
134
--enable-cluster-autoscaler \
141
135
--min-count 1 \
142
136
--max-count 3
143
137
```
144
138
145
-
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
139
+
This command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
146
140
147
141
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
148
142
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
@@ -151,15 +145,17 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
151
145
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
152
146
153
147
> [!NOTE]
154
-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
148
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time. Certain SKUs, including A100 and H100 VM SKUs, aren't available for Azure Linux. For more information, see [GPU-optimized VM sizes in Azure][gpu-skus].
155
149
156
-
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
150
+
---
157
151
158
-
```console
152
+
1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
153
+
154
+
```bash
159
155
kubectl create namespace gpu-resources
160
156
```
161
157
162
-
3. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
158
+
2. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
163
159
164
160
```yaml
165
161
apiVersion: apps/v1
@@ -211,12 +207,97 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
211
207
path: /var/lib/kubelet/device-plugins
212
208
```
213
209
214
-
4. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
210
+
3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
215
211
216
-
```console
212
+
```bash
217
213
kubectl apply -f nvidia-device-plugin-ds.yaml
218
214
```
219
215
216
+
4. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
217
+
218
+
### Use NVIDIA GPU Operator with AKS
219
+
220
+
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPU including driver installation, the [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file), the NVIDIA container runtime, and more. Since the GPU Operator handles these components, it's not necessary to manually install the NVIDIA device plugin. This also means that the automatic GPU driver installation on AKS is no longer required.
221
+
222
+
1. Skip automatic GPU driver installation by creating a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command with `--skip-gpu-driver-install`. Adding the `--skip-gpu-driver-install` flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
223
+
224
+
2. Follow the NVIDIA documentation to [Install the GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html#install-nvidiagpu:~:text=NVIDIA%20GPU%20Operator-,Installing%20the%20NVIDIA%20GPU%20Operator,-%EF%83%81).
225
+
226
+
3. Now that you successfully installed the GPU Operator, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
227
+
228
+
> [!WARNING]
229
+
> We don't recommend manually installing the NVIDIA device plugin daemon set with clusters using the AKS GPU image.
230
+
231
+
### Use the AKS GPU image (preview)
232
+
233
+
AKS provides a fully configured AKS image containing the [NVIDIA device plugin for Kubernetes][nvidia-github]. The AKS GPU image is currently only supported for Ubuntu 18.04.
234
+
235
+
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
236
+
237
+
1. Install the `aks-preview` Azure CLI extension using the [`az extension add`][az-extension-add] command.
238
+
239
+
```azurecli-interactive
240
+
az extension add --name aks-preview
241
+
```
242
+
243
+
2. Update to the latest version of the extension using the [`az extension update`][az-extension-update] command.
244
+
245
+
```azurecli-interactive
246
+
az extension update --name aks-preview
247
+
```
248
+
249
+
3. Register the `GPUDedicatedVHDPreview` feature flag using the [`az feature register`][az-feature-register] command.
250
+
251
+
```azurecli-interactive
252
+
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
253
+
```
254
+
255
+
It takes a few minutes for the status to show *Registered*.
256
+
257
+
4. Verify the registration status using the [`az feature show`][az-feature-show] command.
258
+
259
+
```azurecli-interactive
260
+
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
261
+
```
262
+
263
+
5. When the status reflects *Registered*, refresh the registration of the *Microsoft.ContainerService* resource provider using the [`az provider register`][az-provider-register] command.
264
+
265
+
```azurecli-interactive
266
+
az provider register --namespace Microsoft.ContainerService
267
+
```
268
+
269
+
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
270
+
271
+
6. Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
272
+
273
+
```azurecli-interactive
274
+
az aks nodepool add \
275
+
--resource-group myResourceGroup \
276
+
--cluster-name myAKSCluster \
277
+
--name gpunp \
278
+
--node-count 1 \
279
+
--node-vm-size Standard_NC6s_v3 \
280
+
--node-taints sku=gpu:NoSchedule \
281
+
--aks-custom-headers UseGPUDedicatedVHD=true \
282
+
--enable-cluster-autoscaler \
283
+
--min-count 1 \
284
+
--max-count 3
285
+
```
286
+
287
+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
288
+
289
+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
290
+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
291
+
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
292
+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
293
+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
294
+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
295
+
296
+
> [!NOTE]
297
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
298
+
299
+
7. Now that you successfully created a node pool using the GPU image, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
300
+
220
301
## Confirm that GPUs are schedulable
221
302
222
303
After creating your cluster, confirm that GPUs are schedulable in Kubernetes.
0 commit comments