Skip to content

Commit 979ea54

Browse files
authored
Merge pull request #249868 from schaffererin/multi-instance-gpu-freshness
MIG node pool in AKS freshness pass
2 parents 6a3f310 + 264df68 commit 979ea54

File tree

2 files changed

+146
-125
lines changed

2 files changed

+146
-125
lines changed

articles/aks/TOC.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@
293293
href: manage-node-pools.md
294294
- name: Add an Azure Spot node pool
295295
href: spot-node-pool.md
296-
- name: Multi-instance GPU Node pool
296+
- name: Multi-instance GPU node pool
297297
href: gpu-multi-instance.md
298298
- name: Node pool snapshot
299299
href: node-pool-snapshot.md

articles/aks/gpu-multi-instance.md

Lines changed: 145 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -1,169 +1,182 @@
11
---
2-
title: Multi-instance GPU Node pool
3-
description: Learn how to create a Multi-instance GPU Node pool and schedule tasks on it
2+
title: Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)
3+
description: Learn how to create a multi-instance GPU node pool in Azure Kubernetes Service (AKS).
44
ms.topic: article
5-
ms.date: 1/24/2022
5+
ms.date: 08/30/2023
66
ms.author: juda
77
---
88

9-
# Multi-instance GPU Node pool
9+
# Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)
1010

11-
Nvidia's A100 GPU can be divided in up to seven independent instances. Each instance has their own memory and Stream Multiprocessor (SM). For more information on the Nvidia A100, follow [Nvidia A100 GPU][Nvidia A100 GPU].
11+
Nvidia's A100 GPU can be divided in up to seven independent instances. Each instance has its own memory and Stream Multiprocessor (SM). For more information on the Nvidia A100, see [Nvidia A100 GPU][Nvidia A100 GPU].
1212

13-
This article will walk you through how to create a multi-instance GPU node pool on Azure Kubernetes Service clusters and schedule tasks.
13+
This article walks you through how to create a multi-instance GPU node pool in an Azure Kubernetes Service (AKS) cluster.
1414

15-
## GPU Instance Profile
15+
## Prerequisites
1616

17-
GPU Instance Profiles define how a GPU will be partitioned. The following table shows the available GPU Instance Profile for the `Standard_ND96asr_v4`
17+
* An Azure account with an active subscription. If you don't have one, you can [create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
18+
* Azure CLI version 2.2.0 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
19+
* The Kubernetes command-line client, [kubectl](https://kubernetes.io/docs/reference/kubectl/), installed and configured. If you use Azure Cloud Shell, `kubectl` is already installed. If you want to install it locally, you can use the [`az aks install-cli`][az-aks-install-cli] command.
20+
* Helm v3 installed and configured. For more information, see [Installing Helm](https://helm.sh/docs/intro/install/).
1821

22+
## GPU instance profiles
1923

20-
| Profile Name | Fraction of SM |Fraction of Memory | Number of Instances created |
24+
GPU instance profiles define how GPUs are partitioned. The following table shows the available GPU instance profile for the `Standard_ND96asr_v4`:
25+
26+
| Profile name | Fraction of SM |Fraction of memory | Number of instances created |
2127
|--|--|--|--|
2228
| MIG 1g.5gb | 1/7 | 1/8 | 7 |
2329
| MIG 2g.10gb | 2/7 | 2/8 | 3 |
2430
| MIG 3g.20gb | 3/7 | 4/8 | 2 |
2531
| MIG 4g.20gb | 4/7 | 4/8 | 1 |
2632
| MIG 7g.40gb | 7/7 | 8/8 | 1 |
2733

28-
As an example, the GPU Instance Profile of `MIG 1g.5gb` indicates that each GPU instance will have 1g SM(Computing resource) and 5gb memory. In this case, the GPU will be partitioned into seven instances.
34+
As an example, the GPU instance profile of `MIG 1g.5gb` indicates that each GPU instance has 1g SM(Computing resource) and 5gb memory. In this case, the GPU is partitioned into seven instances.
2935

30-
The available GPU Instance Profiles available for this instance size are `MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, `MIG7g`
36+
The available GPU instance profiles available for this instance size include `MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, and `MIG7g`.
3137

3238
> [!IMPORTANT]
33-
> The applied GPU Instance Profile cannot be changed after node pool creation.
34-
39+
> You can't change the applied GPU instance profile after node pool creation.
3540
3641
## Create an AKS cluster
37-
To get started, create a resource group and an AKS cluster. If you already have a cluster, you can skip this step. Follow the example below to the resource group name `myresourcegroup` in the `southcentralus` region:
3842

39-
```azurecli-interactive
40-
az group create --name myresourcegroup --location southcentralus
41-
```
43+
1. Create an Azure resource group using the [`az group create`][az-group-create] command.
44+
45+
```azurecli-interactive
46+
az group create --name myResourceGroup --location southcentralus
47+
```
48+
49+
2. Create an AKS cluster using the [`az aks create`][az-aks-create] command.
4250
43-
```azurecli-interactive
44-
az aks create \
45-
--resource-group myresourcegroup \
46-
--name migcluster\
47-
--node-count 1
48-
```
51+
```azurecli-interactive
52+
az aks create \
53+
--resource-group myResourceGroup \
54+
--name myAKSCluster\
55+
--node-count 1
56+
```
4957
5058
## Create a multi-instance GPU node pool
5159
52-
You can choose to either use the `az` command line or http request to the ARM API to create the node pool
53-
54-
### Azure CLI
55-
If you're using command line, use the `az aks nodepool add` command to create the node pool and specify the GPU instance profile through `--gpu-instance-profile`
56-
```
57-
58-
az aks nodepool add \
59-
--name mignode \
60-
--resource-group myresourcegroup \
61-
--cluster-name migcluster \
62-
--node-vm-size Standard_ND96asr_v4 \
63-
--gpu-instance-profile MIG1g
64-
```
65-
66-
### HTTP request
67-
68-
If you're using http request, you can place GPU instance profile in the request body:
69-
```
70-
{
71-
"properties": {
72-
"count": 1,
73-
"vmSize": "Standard_ND96asr_v4",
74-
"type": "VirtualMachineScaleSets",
75-
"gpuInstanceProfile": "MIG1g"
60+
You can use either the Azure CLI or an HTTP request to the ARM API to create the node pool.
61+
62+
### [Azure CLI](#tab/azure-cli)
63+
64+
* Create a multi-instance GPU node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command and specify the GPU instance profile.
65+
66+
```azurecli-interactive
67+
az aks nodepool add \
68+
--name mignode \
69+
--resource-group myResourceGroup \
70+
--cluster-name myAKSCluster \
71+
--node-vm-size Standard_ND96asr_v4 \
72+
--gpu-instance-profile MIG1g
73+
```
74+
75+
### [HTTP request](#tab/http-request)
76+
77+
* Create a multi-instance GPU node pool by placing the GPU instance profile in the request body.
78+
79+
```http
80+
{
81+
"properties": {
82+
"count": 1,
83+
"vmSize": "Standard_ND96asr_v4",
84+
"type": "VirtualMachineScaleSets",
85+
"gpuInstanceProfile": "MIG1g"
86+
}
7687
}
77-
}
78-
```
88+
```
89+
90+
---
91+
92+
## Determine multi-instance GPU (MIG) strategy
93+
94+
Before you install the Nvidia plugins, you need to specify which multi-instance GPU (MIG) strategy to use for GPU partitioning: *Single strategy* or *Mixed strategy*. The two strategies don't affect how you execute CPU workloads, but how GPU resources are displayed.
7995
96+
* **Single strategy**: The single strategy treats every GPU instance as a GPU. If you use this strategy, the GPU resources are displayed as `nvidia.com/gpu: 1`.
97+
* **Mixed strategy**: The mixed strategy exposes the GPU instances and the GPU instance profile. If you use this strategy, the GPU resource are displayed as `nvidia.com/mig1g.5gb: 1`.
8098
99+
## Install the NVIDIA device plugin and GPU feature discovery
81100
101+
1. Set your MIG strategy as an environment variable. You can use either single or mixed strategy.
82102
83-
## Run tasks using kubectl
103+
```azurecli-interactive
104+
# Single strategy
105+
export MIG_STRATEGY=single
106+
107+
# Mixed strategy
108+
export MIG_STRATEGY=mixed
109+
```
84110
85-
### MIG strategy
86-
Before you install the Nvidia plugins, you need to specify which strategy to use for GPU partitioning.
111+
2. Add the Nvidia device plugin and GPU feature discovery helm repos using the `helm repo add` and `helm repo update` commands.
87112
88-
The two strategies "Single" and "Mixed" won't affect how you execute CPU workloads, but how GPU resources will be displayed.
113+
```azurecli-interactive
114+
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
115+
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
116+
helm repo update
117+
```
89118
90-
- Single Strategy
119+
3. Install the Nvidia device plugin using the `helm install` command.
91120
92-
The single strategy treats every GPU instance as a GPU. If you're using this strategy, the GPU resources will be displayed as:
121+
```azurecli-interactive
122+
helm install \
123+
--version=0.7.0 \
124+
--generate-name \
125+
--set migStrategy=${MIG_STRATEGY} \
126+
nvdp/nvidia-device-plugin
127+
```
93128
94-
```
95-
nvidia.com/gpu: 1
96-
```
129+
4. Install the GPU feature discovery using the `helm install` command.
97130
98-
- Mixed Strategy
131+
```azurecli-interactive
132+
helm install \
133+
--version=0.2.0 \
134+
--generate-name \
135+
--set migStrategy=${MIG_STRATEGY} \
136+
nvgfd/gpu-feature-discovery
137+
```
99138
100-
The mixed strategy will expose the GPU instances and the GPU instance profile. If you use this strategy, the GPU resource will be displayed as:
139+
## Confirm multi-instance GPU capability
101140
102-
```
103-
nvidia.com/mig1g.5gb: 1
104-
```
141+
1. Configure `kubectl` to connect to your AKS cluster using the [`az aks get-credentials`][az-aks-get-credentials] command.
105142
106-
### Install the NVIDIA device plugin and GPU feature discovery
143+
```azurecli-interactive
144+
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
145+
```
107146
108-
Set your MIG Strategy
109-
```
110-
export MIG_STRATEGY=single
111-
```
112-
or
113-
```
114-
export MIG_STRATEGY=mixed
115-
```
147+
2. Verify the connection to your cluster using the `kubectl get` command to return a list of cluster nodes.
116148
117-
Install the Nvidia device plugin and GPU feature discovery using helm
149+
```azurecli-interactive
150+
kubectl get nodes -o wide
151+
```
118152
119-
```
120-
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
121-
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
122-
helm repo update #do not forget to update the helm repo
123-
```
153+
3. Confirm the node has multi-instance GPU capability using the `kubectl describe node` command. The following example command describes the node named *mignode*, which uses MIG1g as the GPU instance profile.
124154
125-
```
126-
helm install \
127-
--version=0.7.0 \
128-
--generate-name \
129-
--set migStrategy=${MIG_STRATEGY} \
130-
nvdp/nvidia-device-plugin
131-
```
155+
```azurecli-interactive
156+
kubectl describe node mignode
157+
```
132158
133-
```
134-
helm install \
135-
--version=0.2.0 \
136-
--generate-name \
137-
--set migStrategy=${MIG_STRATEGY} \
138-
nvgfd/gpu-feature-discovery
139-
```
159+
Your output should resemble the following example output:
140160
161+
```output
162+
# Single strategy output
163+
Allocatable:
164+
nvidia.com/gpu: 56
141165
142-
### Confirm multi-instance GPU capability
143-
As an example, if you used MIG1g as the GPU instance profile, confirm the node has multi-instance GPU capability by running:
144-
```
145-
kubectl describe node mignode
146-
```
147-
If you're using single strategy, you'll see:
148-
```
149-
Allocatable:
150-
nvidia.com/gpu: 56
151-
```
152-
If you're using mixed strategy, you'll see:
153-
```
154-
Allocatable:
155-
nvidia.com/mig-1g.5gb: 56
156-
```
166+
# Mixed strategy output
167+
Allocatable:
168+
nvidia.com/mig-1g.5gb: 56
169+
```
157170
158-
### Schedule work
171+
## Schedule work
159172
160173
The following examples are based on cuda base image version 12.1.1 for Ubuntu22.04, tagged as `12.1.1-base-ubuntu22.04`.
161174
162-
- Single strategy
175+
### Single strategy
163176
164177
1. Create a file named `single-strategy-example.yaml` and copy in the following manifest.
165178
166-
```yaml
179+
```yaml
167180
apiVersion: v1
168181
kind: Pod
169182
metadata:
@@ -181,17 +194,17 @@ The following examples are based on cuda base image version 12.1.1 for Ubuntu22.
181194
182195
2. Deploy the application using the `kubectl apply` command and specify the name of your YAML manifest.
183196
184-
```
197+
```azurecli-interactive
185198
kubectl apply -f single-strategy-example.yaml
186199
```
187-
200+
188201
3. Verify the allocated GPU devices using the `kubectl exec` command. This command returns a list of the cluster nodes.
189202
190-
```
203+
```azurecli-interactive
191204
kubectl exec nvidia-single -- nvidia-smi -L
192205
```
193206
194-
The following example resembles output showing successfully created deployments and services.
207+
The following example resembles output showing successfully created deployments and services:
195208
196209
```output
197210
GPU 0: NVIDIA A100 40GB PCIe (UUID: GPU-48aeb943-9458-4282-da24-e5f49e0db44b)
@@ -204,7 +217,7 @@ The following examples are based on cuda base image version 12.1.1 for Ubuntu22.
204217
MIG 1g.5gb Device 6: (UUID: MIG-37e055e8-8890-567f-a646-ebf9fde3ce7a)
205218
```
206219
207-
- Mixed mode strategy
220+
### Mixed strategy
208221
209222
1. Create a file named `mixed-strategy-example.yaml` and copy in the following manifest.
210223
@@ -226,33 +239,41 @@ The following examples are based on cuda base image version 12.1.1 for Ubuntu22.
226239
227240
2. Deploy the application using the `kubectl apply` command and specify the name of your YAML manifest.
228241
229-
```
242+
```azurecli-interactive
230243
kubectl apply -f mixed-strategy-example.yaml
231244
```
232-
245+
233246
3. Verify the allocated GPU devices using the `kubectl exec` command. This command returns a list of the cluster nodes.
234247
235-
```
248+
```azurecli-interactive
236249
kubectl exec nvidia-mixed -- nvidia-smi -L
237250
```
238251
239-
The following example resembles output showing successfully created deployments and services.
252+
The following example resembles output showing successfully created deployments and services:
240253
241254
```output
242255
GPU 0: NVIDIA A100 40GB PCIe (UUID: GPU-48aeb943-9458-4282-da24-e5f49e0db44b)
243256
MIG 1g.5gb Device 0: (UUID: MIG-fb42055e-9e53-5764-9278-438605a3014c)
244257
```
245258
246259
> [!IMPORTANT]
247-
> The "latest" tag for CUDA images has been deprecated on Docker Hub.
248-
> Please refer to [NVIDIA's repository](https://hub.docker.com/r/nvidia/cuda/tags) for the latest images and corresponding tags
260+
> The `latest` tag for CUDA images has been deprecated on Docker Hub. Please refer to [NVIDIA's repository](https://hub.docker.com/r/nvidia/cuda/tags) for the latest images and corresponding tags.
249261
250262
## Troubleshooting
251-
- If you do not see multi-instance GPU capability after the node pool has been created, confirm the API version is not older than 2021-08-01.
252263
253-
<!-- LINKS - internal -->
264+
If you don't see multi-instance GPU capability after creating the node pool, confirm the API version isn't older than *2021-08-01*.
265+
266+
## Next steps
254267
268+
For more information on AKS node pools, see [Manage node pools for a cluster in AKS](./manage-node-pools.md).
269+
270+
<!-- LINKS - internal -->
271+
[az-group-create]: /cli/azure/group#az_group_create
272+
[az-aks-create]: /cli/azure/aks#az_aks_create
273+
[az-aks-nodepool-add]: /cli/azure/aks/nodepool#az_aks_nodepool_add
274+
[install-azure-cli]: /cli/azure/install-azure-cli
275+
[az-aks-install-cli]: /cli/azure/aks#az_aks_install_cli
276+
[az-aks-get-credentials]: /cli/azure/aks#az_aks_get_credentials
255277
256278
<!-- LINKS - external-->
257279
[Nvidia A100 GPU]:https://www.nvidia.com/en-us/data-center/a100/
258-

0 commit comments

Comments
 (0)