Skip to content

Commit 2f46563

Browse files
committed
Freshness/editing pass for Use GPUs on AKS
1 parent 63614fa commit 2f46563

File tree

1 file changed

+22
-7
lines changed

1 file changed

+22
-7
lines changed

articles/aks/gpu-cluster.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ AKS provides a fully configured AKS image containing the [NVIDIA device plugin f
8181
8282
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
8383
84-
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command. The following example command adds a node pool named *gpunp* to the *myAKSCluster* in the *myResourceGroup* resource group, sets the VM size for the node in the node pool to *Standard_NC6*, enables the cluster autoscaler, configures the cluster autoscaler to maintain a minimum of one node and a maximum of three nodes in the node pool, and specifies a specialized AKS GPU image and a *sku=gpu:NoSchedule* taint on the new node pool:
84+
* Add a node pool using the [`az aks nodepool add`][az-aks-nodepool-add] command.
8585
8686
```azurecli-interactive
8787
az aks nodepool add \
@@ -97,16 +97,23 @@ Now that you updated your cluster to use the AKS GPU image, you can add a node p
9797
--max-count 3
9898
```
9999
100+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
101+
102+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6*.
103+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
104+
* `--aks-custom-headers`: Specifies a specialized AKS GPU image, *UseGPUDedicatedVHD=true*. If your GPU sku requires generation 2 VMs, use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true* instead.
105+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
106+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
107+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
108+
100109
> [!NOTE]
101-
>
102-
> * A taint and VM size can only be set for node pools during node pool creation, but the autoscaler settings can be updated at any time.
103-
> * If your GPU sku requires generation 2 VMs, instead use *--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true*.
110+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
104111
105112
### Manually install the NVIDIA device plugin
106113
107114
You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs.
108115
109-
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command. The following example command adds a node pool named *gpunp* to the *myAKSCluster* in the *myResourceGroup* resource group, sets the VM size for the nodes in the node pool to *Standard_NC6*, enables the cluster autoscaler, configures the cluster autoscaler to maintain a minimum of one node and a maximum of three nodes in the node pool, and specifies a *sku=gpu:NoSchedule* taint for the node pool.
116+
1. Add a node pool to your cluster using the [`az aks nodepool add`][az-aks-nodepool-add] command.
110117
111118
```azurecli-interactive
112119
az aks nodepool add \
@@ -121,8 +128,16 @@ You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on eac
121128
--max-count 3
122129
```
123130
131+
The previous example command adds a node pool named *gpunp* to *myAKSCluster* in *myResourceGroup* and uses parameters to configure the following node pool settings:
132+
133+
* `--node-vm-size`: Sets the VM size for the node in the node pool to *Standard_NC6*.
134+
* `--node-taints`: Specifies a *sku=gpu:NoSchedule* taint on the node pool.
135+
* `--enable-cluster-autoscaler`: Enables the cluster autoscaler.
136+
* `--min-count`: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
137+
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
138+
124139
> [!NOTE]
125-
> A taint and VM size can only be set for node pools during node pool creation, but the autoscaler settings can be updated at any time.
140+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
126141
127142
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
128143
@@ -351,7 +366,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
351366
352367
| Metric name | Metric dimension (tags) | Description |
353368
|-------------|-------------------------|-------------|
354-
| containerGpuDutyCycle | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor` | Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100. |
369+
| containerGpuDutyCycle | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`| Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100. |
355370
| containerGpuLimits | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU. |
356371
| containerGpuRequests | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName` | Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU. |
357372
| containerGpumemoryTotalBytes | `container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor` | Amount of GPU Memory in bytes available to use for a specific container. |

0 commit comments

Comments
 (0)