Skip to content

Commit 6806663

Browse files
authored
Merge pull request #263851 from schaffererin/aks-gpu-edits
Review/update to changes to Use GPUs doc
2 parents 6523bb3 + d7a603b commit 6806663

File tree

1 file changed

+137
-56
lines changed

1 file changed

+137
-56
lines changed

articles/aks/gpu-cluster.md

Lines changed: 137 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,23 @@ Graphical processing units (GPUs) are often used for compute-intensive workloads
1414
This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.
1515

1616
## Supported GPU-enabled VMs
17+
1718
To view supported GPU-enabled VMs, see [GPU-optimized VM sizes in Azure][gpu-skus]. For AKS node pools, we recommend a minimum size of *Standard_NC6s_v3*. The NVv4 series (based on AMD GPUs) aren't supported on AKS.
1819

1920
> [!NOTE]
2021
> GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].
2122
2223
## Limitations
23-
* AKS does not support Windows GPU-enabled node pools.
24+
2425
* If you're using an Azure Linux GPU-enabled node pool, automatic security patches aren't applied, and the default behavior for the cluster is *Unmanaged*. For more information, see [auto-upgrade](./auto-upgrade-node-image.md).
25-
* [NVadsA10](../virtual-machines/nva10v5-series.md) v5-series are not a recommended SKU for GPU VHD.
26+
* [NVadsA10](../virtual-machines/nva10v5-series.md) v5-series are *not* a recommended SKU for GPU VHD.
27+
* AKS doesn't support Windows GPU-enabled node pools.
28+
* Updating an existing node pool to add GPU isn't supported.
2629

2730
## Before you begin
2831

2932
* This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the [Azure CLI][aks-quickstart-cli], [Azure PowerShell][aks-quickstart-powershell], or the [Azure portal][aks-quickstart-portal].
30-
* You also need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
33+
* You need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
3134

3235
## Get the credentials for your cluster
3336

@@ -39,63 +42,55 @@ To view supported GPU-enabled VMs, see [GPU-optimized VM sizes in Azure][gpu-sku
3942
4043
## Options for using NVIDIA GPUs
4144
42-
There are three ways to add the NVIDIA device plugin:
43-
44-
1. [Using the AKS GPU image](#update-your-cluster-to-use-the-aks-gpu-image-preview)
45-
2. [Manually installing the NVIDIA device plugin](#manually-install-the-nvidia-device-plugin)
46-
3. Using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html)
47-
48-
### Use NVIDIA GPU Operator with AKS
49-
You can use the NVIDIA GPU Operator by skipping the gpu driver installation on AKS. For more information about using the NVIDIA GPU Operator with AKS, see [NVIDIA Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html).
50-
51-
Adding the node pool tag `SkipGPUDriverInstall=true` will skip installing the GPU driver automatically on newly created nodes in the node pool. Any existing nodes will not be changed - the pool can be scaled to 0 and back up to make the change take effect. You can specify the tag using the `--nodepool-tags` argument to [`az aks create`][az-aks-create] command (for a new cluster) or `--tags` with [`az aks nodepool add`][az-aks-nodepool-add] or [`az aks nodepool update`][az-aks-nodepool-update].
52-
53-
> [!WARNING]
54-
> We don't recommend manually installing the NVIDIA device plugin daemon set with clusters using the AKS GPU image.
45+
Using NVIDIA GPUs involves the installation of various NVIDIA software components such as the [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file), GPU driver installation, and more.
5546
56-
### Update your cluster to use the AKS GPU image (preview)
47+
### Skip GPU driver installation (preview)
5748
58-
AKS provides a fully configured AKS image containing the [NVIDIA device plugin for Kubernetes][nvidia-github].
49+
AKS has automatic GPU driver installation enabled by default. In some cases, such as installing your own drivers or using the NVIDIA GPU Operator, you may want to skip GPU driver installation.
5950
6051
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
6152
62-
1. Install the `aks-preview` Azure CLI extension using the [`az extension add`][az-extension-add] command.
53+
1. Register or update the aks-preview extension using the [`az extension add`][az-extension-add] or [`az extension update`][az-extension-update] command.
6354
6455
```azurecli-interactive
56+
# Register the aks-preview extension
6557
az extension add --name aks-preview
66-
```
67-
68-
2. Update to the latest version of the extension using the [`az extension update`][az-extension-update] command.
6958
70-
```azurecli-interactive
59+
# Update the aks-preview extension
7160
az extension update --name aks-preview
7261
```
7362
74-
3. Register the `GPUDedicatedVHDPreview` feature flag using the [`az feature register`][az-feature-register] command.
63+
2. Create a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command with the `--skip-gpu-driver-install` flag to skip automatic GPU driver installation.
7564
7665
```azurecli-interactive
77-
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
66+
az aks nodepool add \
67+
--resource-group myResourceGroup \
68+
--cluster-name myAKSCluster \
69+
--name gpunp \
70+
--node-count 1 \
71+
--skip-gpu-driver-install \
72+
--node-vm-size Standard_NC6s_v3 \
73+
--node-taints sku=gpu:NoSchedule \
74+
--enable-cluster-autoscaler \
75+
--min-count 1 \
76+
--max-count 3
7877
```
7978
80-
It takes a few minutes for the status to show *Registered*.
79+
Adding the `--skip-gpu-driver-install` flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
8180
82-
4. Verify the registration status using the [`az feature show`][az-feature-show] command.
81+
### NVIDIA device plugin installation
8382
84-
```azurecli-interactive
85-
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
86-
```
83+
NVIDIA device plugin installation is required when using GPUs on AKS. In some cases, the installation is handled automatically, such as when using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html) or the [AKS GPU image (preview)](#use-the-aks-gpu-image-preview). Alternatively, you can manually install the NVIDIA device plugin.
8784
88-
5. When the status reflects *Registered*, refresh the registration of the *Microsoft.ContainerService* resource provider using the [`az provider register`][az-provider-register] command.
85+
#### Manually install the NVIDIA device plugin
8986
90-
```azurecli-interactive
91-
az provider register --namespace Microsoft.ContainerService
92-
```
87+
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs. This is the recommended approach when using GPU-enabled node pools for Azure Linux.
9388
94-
#### Add a node pool for GPU nodes
89+
##### [Ubuntu Linux node pool (default SKU)](#tab/add-ubuntu-gpu-node-pool)
9590
96-
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
91+
To use the default OS SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.
9792
98-
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
93+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
9994
10095
```azurecli-interactive
10196
az aks nodepool add \
@@ -105,44 +100,43 @@ Now that you updated your cluster to use the AKS GPU image, you can add a node p
105100
--node-count 1 \
106101
--node-vm-size Standard_NC6s_v3 \
107102
--node-taints sku=gpu:NoSchedule \
108-
--aks-custom-headers UseGPUDedicatedVHD=true \
109103
--enable-cluster-autoscaler \
110104
--min-count 1 \
111105
--max-count 3
112106
```
113107
114-
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
108+
This command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
115109
116-
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
117-
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
118-
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
119-
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
120-
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
121-
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
110+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
111+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
112+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
113+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
114+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
122115
123116
> [!NOTE]
124-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
117+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
125118
126-
### Manually install the NVIDIA device plugin
119+
##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool)
127120
128-
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs.
121+
To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` during node pool creation. The `os-type` is set to `Linux` by default.
129122
130-
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
123+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command with the `--os-sku` flag set to `AzureLinux`.
131124
132125
```azurecli-interactive
133126
az aks nodepool add \
134127
--resource-group myResourceGroup \
135128
--cluster-name myAKSCluster \
136129
--name gpunp \
137130
--node-count 1 \
131+
--os-sku AzureLinux \
138132
--node-vm-size Standard_NC6s_v3 \
139133
--node-taints sku=gpu:NoSchedule \
140134
--enable-cluster-autoscaler \
141135
--min-count 1 \
142136
--max-count 3
143137
```
144138
145-
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
139+
This command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
146140
147141
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
148142
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
@@ -151,15 +145,17 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
151145
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
152146
153147
> [!NOTE]
154-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
148+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time. Certain SKUs, including A100 and H100 VM SKUs, aren't available for Azure Linux. For more information, see [GPU-optimized VM sizes in Azure][gpu-skus].
155149
156-
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
150+
---
157151
158-
```console
152+
1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
153+
154+
```bash
159155
kubectl create namespace gpu-resources
160156
```
161157
162-
3. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
158+
2. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
163159
164160
```yaml
165161
apiVersion: apps/v1
@@ -211,12 +207,97 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
211207
path: /var/lib/kubelet/device-plugins
212208
```
213209
214-
4. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
210+
3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
215211
216-
```console
212+
```bash
217213
kubectl apply -f nvidia-device-plugin-ds.yaml
218214
```
219215
216+
4. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
217+
218+
### Use NVIDIA GPU Operator with AKS
219+
220+
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPU including driver installation, the [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file), the NVIDIA container runtime, and more. Since the GPU Operator handles these components, it's not necessary to manually install the NVIDIA device plugin. This also means that the automatic GPU driver installation on AKS is no longer required.
221+
222+
1. Skip automatic GPU driver installation by creating a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command with `--skip-gpu-driver-install`. Adding the `--skip-gpu-driver-install` flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
223+
224+
2. Follow the NVIDIA documentation to [Install the GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html#install-nvidiagpu:~:text=NVIDIA%20GPU%20Operator-,Installing%20the%20NVIDIA%20GPU%20Operator,-%EF%83%81).
225+
226+
3. Now that you successfully installed the GPU Operator, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
227+
228+
> [!WARNING]
229+
> We don't recommend manually installing the NVIDIA device plugin daemon set with clusters using the AKS GPU image.
230+
231+
### Use the AKS GPU image (preview)
232+
233+
AKS provides a fully configured AKS image containing the [NVIDIA device plugin for Kubernetes][nvidia-github]. The AKS GPU image is currently only supported for Ubuntu 18.04.
234+
235+
[!INCLUDE [preview features callout](includes/preview/preview-callout.md)]
236+
237+
1. Install the `aks-preview` Azure CLI extension using the [`az extension add`][az-extension-add] command.
238+
239+
```azurecli-interactive
240+
az extension add --name aks-preview
241+
```
242+
243+
2. Update to the latest version of the extension using the [`az extension update`][az-extension-update] command.
244+
245+
```azurecli-interactive
246+
az extension update --name aks-preview
247+
```
248+
249+
3. Register the `GPUDedicatedVHDPreview` feature flag using the [`az feature register`][az-feature-register] command.
250+
251+
```azurecli-interactive
252+
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
253+
```
254+
255+
It takes a few minutes for the status to show *Registered*.
256+
257+
4. Verify the registration status using the [`az feature show`][az-feature-show] command.
258+
259+
```azurecli-interactive
260+
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
261+
```
262+
263+
5. When the status reflects *Registered*, refresh the registration of the *Microsoft.ContainerService* resource provider using the [`az provider register`][az-provider-register] command.
264+
265+
```azurecli-interactive
266+
az provider register --namespace Microsoft.ContainerService
267+
```
268+
269+
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
270+
271+
6. Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
272+
273+
```azurecli-interactive
274+
az aks nodepool add \
275+
--resource-group myResourceGroup \
276+
--cluster-name myAKSCluster \
277+
--name gpunp \
278+
--node-count 1 \
279+
--node-vm-size Standard_NC6s_v3 \
280+
--node-taints sku=gpu:NoSchedule \
281+
--aks-custom-headers UseGPUDedicatedVHD=true \
282+
--enable-cluster-autoscaler \
283+
--min-count 1 \
284+
--max-count 3
285+
```
286+
287+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
288+
289+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6s_v3*.
290+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
291+
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
292+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
293+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
294+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
295+
296+
> [!NOTE]
297+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
298+
299+
7. Now that you successfully created a node pool using the GPU image, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
300+
220301
## Confirm that GPUs are schedulable
221302
222303
After creating your cluster, confirm that GPUs are schedulable in Kubernetes.

0 commit comments

Comments
 (0)