You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/aks/gpu-cluster.md
+22-7Lines changed: 22 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -81,7 +81,7 @@ AKS provides a fully configured AKS image containing the [NVIDIA device plugin f
81
81
82
82
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
83
83
84
-
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command. The following example command adds a node pool named *gpunp* to the *myAKSCluster* in the *myResourceGroup* resource group, sets the VM size for the node in the node pool to *Standard_NC6*, enables the cluster autoscaler, configures the cluster autoscaler to maintain a minimum of one node and a maximum of three nodes in the node pool, and specifies a specialized AKS GPU image and a *sku=gpu:NoSchedule* taint on the new node pool:
84
+
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
85
85
86
86
```azurecli-interactive
87
87
az aks nodepool add \
@@ -97,16 +97,23 @@ Now that you updated your cluster to use the AKS GPU image, you can add a node p
97
97
--max-count 3
98
98
```
99
99
100
+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
101
+
102
+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6*.
103
+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
104
+
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
105
+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
106
+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
107
+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
108
+
100
109
> [!NOTE]
101
-
>
102
-
> * A taint and VM size can only be set for node pools during node pool creation, but the autoscaler settings can be updated at any time.
103
-
> * If your GPU sku requires generation 2 VMs, instead use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true*.
110
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
104
111
105
112
### Manually install the NVIDIA device plugin
106
113
107
114
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs.
108
115
109
-
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command. The following example command adds a node pool named *gpunp* to the *myAKSCluster* in the *myResourceGroup* resource group, sets the VM size for the nodes in the node pool to *Standard_NC6*, enables the cluster autoscaler, configures the cluster autoscaler to maintain a minimum of one node and a maximum of three nodes in the node pool, and specifies a *sku=gpu:NoSchedule* taint for the node pool.
116
+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
110
117
111
118
```azurecli-interactive
112
119
az aks nodepool add \
@@ -121,8 +128,16 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
121
128
--max-count 3
122
129
```
123
130
131
+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
132
+
133
+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6*.
134
+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
135
+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
136
+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
137
+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
138
+
124
139
> [!NOTE]
125
-
> A taint and VM size can only be set for node pools during node pool creation, but the autoscaler settings can be updated at any time.
140
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
126
141
127
142
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
128
143
@@ -351,7 +366,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
351
366
352
367
| Metric name | Metric dimension (tags) | Description |
| containerGpuDutyCycle | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`| Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100. |
369
+
| containerGpuDutyCycle | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`| Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100. |
355
370
| containerGpuLimits | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU. |
356
371
| containerGpuRequests | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU. |
357
372
| containerGpumemoryTotalBytes | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor` | Amount of GPU Memory in bytes available to use for a specific container. |
0 commit comments