You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how to create a Multi-instance GPU Node pool and schedule tasks on it
2
+
title: Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)
3
+
description: Learn how to create a multi-instance GPU node pool in Azure Kubernetes Service (AKS).
4
4
ms.topic: article
5
-
ms.date: 1/24/2022
5
+
ms.date: 08/30/2023
6
6
ms.author: juda
7
7
---
8
8
9
-
# Multi-instance GPU Node pool
9
+
# Create a multi-instance GPU node pool in Azure Kubernetes Service (AKS)
10
10
11
-
Nvidia's A100 GPU can be divided in up to seven independent instances. Each instance has their own memory and Stream Multiprocessor (SM). For more information on the Nvidia A100, follow[Nvidia A100 GPU][Nvidia A100 GPU].
11
+
Nvidia's A100 GPU can be divided in up to seven independent instances. Each instance has its own memory and Stream Multiprocessor (SM). For more information on the Nvidia A100, see[Nvidia A100 GPU][Nvidia A100 GPU].
12
12
13
-
This article will walk you through how to create a multi-instance GPU node pool on Azure Kubernetes Service clusters and schedule tasks.
13
+
This article walks you through how to create a multi-instance GPU node pool in an Azure Kubernetes Service (AKS) cluster.
14
14
15
-
## GPU Instance Profile
15
+
## Prerequisites
16
16
17
-
GPU Instance Profiles define how a GPU will be partitioned. The following table shows the available GPU Instance Profile for the `Standard_ND96asr_v4`
17
+
* An Azure account with an active subscription. If you don't have one, you can [create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
18
+
* Azure CLI version 2.2.0 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
19
+
* The Kubernetes command-line client, [kubectl](https://kubernetes.io/docs/reference/kubectl/), installed and configured. If you use Azure Cloud Shell, `kubectl` is already installed. If you want to install it locally, you can use the [`az aks install-cli`][az-aks-install-cli] command.
20
+
* Helm v3 installed and configured. For more information, see [Installing Helm](https://helm.sh/docs/intro/install/).
18
21
22
+
## GPU instance profiles
19
23
20
-
| Profile Name | Fraction of SM |Fraction of Memory | Number of Instances created |
24
+
GPU instance profiles define how GPUs are partitioned. The following table shows the available GPU instance profile for the `Standard_ND96asr_v4`:
25
+
26
+
| Profile name | Fraction of SM |Fraction of memory | Number of instances created |
21
27
|--|--|--|--|
22
28
| MIG 1g.5gb | 1/7 | 1/8 | 7 |
23
29
| MIG 2g.10gb | 2/7 | 2/8 | 3 |
24
30
| MIG 3g.20gb | 3/7 | 4/8 | 2 |
25
31
| MIG 4g.20gb | 4/7 | 4/8 | 1 |
26
32
| MIG 7g.40gb | 7/7 | 8/8 | 1 |
27
33
28
-
As an example, the GPU Instance Profile of `MIG 1g.5gb` indicates that each GPU instance will have 1g SM(Computing resource) and 5gb memory. In this case, the GPU will be partitioned into seven instances.
34
+
As an example, the GPU instance profile of `MIG 1g.5gb` indicates that each GPU instance has 1g SM(Computing resource) and 5gb memory. In this case, the GPU is partitioned into seven instances.
29
35
30
-
The available GPU Instance Profiles available for this instance size are`MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, `MIG7g`
36
+
The available GPU instance profiles available for this instance size include`MIG1g`, `MIG2g`, `MIG3g`, `MIG4g`, and `MIG7g`.
31
37
32
38
> [!IMPORTANT]
33
-
> The applied GPU Instance Profile cannot be changed after node pool creation.
34
-
39
+
> You can't change the applied GPU instance profile after node pool creation.
35
40
36
41
## Create an AKS cluster
37
-
To get started, create a resource group and an AKS cluster. If you already have a cluster, you can skip this step. Follow the example below to the resource group name `myresourcegroup` in the `southcentralus` region:
38
42
39
-
```azurecli-interactive
40
-
az group create --name myresourcegroup --location southcentralus
41
-
```
43
+
1. Create an Azure resource group using the [`az group create`][az-group-create] command.
44
+
45
+
```azurecli-interactive
46
+
az group create --name myResourceGroup --location southcentralus
47
+
```
48
+
49
+
2. Create an AKS cluster using the [`az aks create`][az-aks-create] command.
42
50
43
-
```azurecli-interactive
44
-
az aks create \
45
-
--resource-group myresourcegroup \
46
-
--name migcluster\
47
-
--node-count 1
48
-
```
51
+
```azurecli-interactive
52
+
az aks create \
53
+
--resource-group myResourceGroup \
54
+
--name myAKSCluster\
55
+
--node-count 1
56
+
```
49
57
50
58
## Create a multi-instance GPU node pool
51
59
52
-
You can choose to either use the `az` command line or http request to the ARM API to create the node pool
53
-
54
-
### Azure CLI
55
-
If you're using command line, use the `az aks nodepool add` command to create the node pool and specify the GPU instance profile through `--gpu-instance-profile`
56
-
```
57
-
58
-
az aks nodepool add \
59
-
--name mignode \
60
-
--resource-group myresourcegroup \
61
-
--cluster-name migcluster \
62
-
--node-vm-size Standard_ND96asr_v4 \
63
-
--gpu-instance-profile MIG1g
64
-
```
65
-
66
-
### HTTP request
67
-
68
-
If you're using http request, you can place GPU instance profile in the request body:
69
-
```
70
-
{
71
-
"properties": {
72
-
"count": 1,
73
-
"vmSize": "Standard_ND96asr_v4",
74
-
"type": "VirtualMachineScaleSets",
75
-
"gpuInstanceProfile": "MIG1g"
60
+
You can use either the Azure CLI or an HTTP request to the ARM API to create the node pool.
61
+
62
+
### [Azure CLI](#tab/azure-cli)
63
+
64
+
* Create a multi-instance GPU node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command and specify the GPU instance profile.
65
+
66
+
```azurecli-interactive
67
+
az aks nodepool add \
68
+
--name mignode \
69
+
--resource-group myResourceGroup \
70
+
--cluster-name myAKSCluster \
71
+
--node-vm-size Standard_ND96asr_v4 \
72
+
--gpu-instance-profile MIG1g
73
+
```
74
+
75
+
### [HTTP request](#tab/http-request)
76
+
77
+
* Create a multi-instance GPU node pool by placing the GPU instance profile in the request body.
78
+
79
+
```http
80
+
{
81
+
"properties": {
82
+
"count": 1,
83
+
"vmSize": "Standard_ND96asr_v4",
84
+
"type": "VirtualMachineScaleSets",
85
+
"gpuInstanceProfile": "MIG1g"
86
+
}
76
87
}
77
-
}
78
-
```
88
+
```
89
+
90
+
---
91
+
92
+
## Determine multi-instance GPU (MIG) strategy
93
+
94
+
Before you install the Nvidia plugins, you need to specify which multi-instance GPU (MIG) strategy to use for GPU partitioning: *Single strategy* or *Mixed strategy*. The two strategies don't affect how you execute CPU workloads, but how GPU resources are displayed.
79
95
96
+
* **Single strategy**: The single strategy treats every GPU instance as a GPU. If you use this strategy, the GPU resources are displayed as `nvidia.com/gpu: 1`.
97
+
* **Mixed strategy**: The mixed strategy exposes the GPU instances and the GPU instance profile. If you use this strategy, the GPU resource are displayed as `nvidia.com/mig1g.5gb: 1`.
80
98
99
+
## Install the NVIDIA device plugin and GPU feature discovery
81
100
101
+
1. Set your MIG strategy as an environment variable. You can use either single or mixed strategy.
82
102
83
-
## Run tasks using kubectl
103
+
```azurecli-interactive
104
+
# Single strategy
105
+
export MIG_STRATEGY=single
106
+
107
+
# Mixed strategy
108
+
export MIG_STRATEGY=mixed
109
+
```
84
110
85
-
### MIG strategy
86
-
Before you install the Nvidia plugins, you need to specify which strategy to use for GPU partitioning.
111
+
2. Add the Nvidia device plugin and GPU feature discovery helm repos using the `helm repo add` and `helm repo update` commands.
87
112
88
-
The two strategies "Single" and "Mixed" won't affect how you execute CPU workloads, but how GPU resources will be displayed.
helm repo update #do not forget to update the helm repo
123
-
```
153
+
3. Confirm the node has multi-instance GPU capability using the `kubectl describe node` command. The following example command describes the node named *mignode*, which uses MIG1g as the GPU instance profile.
124
154
125
-
```
126
-
helm install \
127
-
--version=0.7.0 \
128
-
--generate-name \
129
-
--set migStrategy=${MIG_STRATEGY} \
130
-
nvdp/nvidia-device-plugin
131
-
```
155
+
```azurecli-interactive
156
+
kubectl describe node mignode
157
+
```
132
158
133
-
```
134
-
helm install \
135
-
--version=0.2.0 \
136
-
--generate-name \
137
-
--set migStrategy=${MIG_STRATEGY} \
138
-
nvgfd/gpu-feature-discovery
139
-
```
159
+
Your output should resemble the following example output:
140
160
161
+
```output
162
+
# Single strategy output
163
+
Allocatable:
164
+
nvidia.com/gpu: 56
141
165
142
-
### Confirm multi-instance GPU capability
143
-
As an example, if you used MIG1g as the GPU instance profile, confirm the node has multi-instance GPU capability by running:
144
-
```
145
-
kubectl describe node mignode
146
-
```
147
-
If you're using single strategy, you'll see:
148
-
```
149
-
Allocatable:
150
-
nvidia.com/gpu: 56
151
-
```
152
-
If you're using mixed strategy, you'll see:
153
-
```
154
-
Allocatable:
155
-
nvidia.com/mig-1g.5gb: 56
156
-
```
166
+
# Mixed strategy output
167
+
Allocatable:
168
+
nvidia.com/mig-1g.5gb: 56
169
+
```
157
170
158
-
###Schedule work
171
+
## Schedule work
159
172
160
173
The following examples are based on cuda base image version 12.1.1 for Ubuntu22.04, tagged as `12.1.1-base-ubuntu22.04`.
161
174
162
-
- Single strategy
175
+
### Single strategy
163
176
164
177
1. Create a file named `single-strategy-example.yaml` and copy in the following manifest.
165
178
166
-
```yaml
179
+
```yaml
167
180
apiVersion: v1
168
181
kind: Pod
169
182
metadata:
@@ -181,17 +194,17 @@ The following examples are based on cuda base image version 12.1.1 for Ubuntu22.
181
194
182
195
2. Deploy the application using the `kubectl apply` command and specify the name of your YAML manifest.
183
196
184
-
```
197
+
```azurecli-interactive
185
198
kubectl apply -f single-strategy-example.yaml
186
199
```
187
-
200
+
188
201
3. Verify the allocated GPU devices using the `kubectl exec` command. This command returns a list of the cluster nodes.
189
202
190
-
```
203
+
```azurecli-interactive
191
204
kubectl exec nvidia-single -- nvidia-smi -L
192
205
```
193
206
194
-
The following example resembles output showing successfully created deployments and services.
207
+
The following example resembles output showing successfully created deployments and services:
> The "latest" tag for CUDA images has been deprecated on Docker Hub.
248
-
> Please refer to [NVIDIA's repository](https://hub.docker.com/r/nvidia/cuda/tags) for the latest images and corresponding tags
260
+
> The `latest` tag for CUDA images has been deprecated on Docker Hub. Please refer to [NVIDIA's repository](https://hub.docker.com/r/nvidia/cuda/tags) for the latest images and corresponding tags.
249
261
250
262
## Troubleshooting
251
-
- If you do not see multi-instance GPU capability after the node pool has been created, confirm the API version is not older than 2021-08-01.
252
263
253
-
<!-- LINKS - internal -->
264
+
If you don't see multi-instance GPU capability after creating the node pool, confirm the API version isn't older than *2021-08-01*.
265
+
266
+
## Next steps
254
267
268
+
For more information on AKS node pools, see [Manage node pools for a cluster in AKS](./manage-node-pools.md).
0 commit comments