Skip to content

Commit 07bdf89

Browse files
committed
helm gpu-operator instead of ClusterResourceSet
1 parent 945ee51 commit 07bdf89

File tree

14 files changed

+127
-15327
lines changed

14 files changed

+127
-15327
lines changed

docs/book/src/topics/gpu.md

Lines changed: 43 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -39,31 +39,17 @@ Apply the manifest from the previous step to your management cluster to have CAP
3939
workload cluster:
4040

4141
```bash
42-
$ kubectl apply -f azure-gpu-cluster.yaml --server-side
42+
$ kubectl apply -f azure-gpu-cluster.yaml
4343
cluster.cluster.x-k8s.io/azure-gpu serverside-applied
4444
azurecluster.infrastructure.cluster.x-k8s.io/azure-gpu serverside-applied
4545
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
4646
azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
4747
machinedeployment.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
4848
azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
4949
kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
50-
clusterresourceset.addons.cluster.x-k8s.io/crs-gpu-operator serverside-applied
51-
configmap/nvidia-clusterpolicy-crd serverside-applied
52-
configmap/nvidia-gpu-operator-components serverside-applied
53-
clusterresourceset.addons.cluster.x-k8s.io/azure-gpu-crs-0 serverside-applied
5450
```
5551

56-
<aside class="note">
57-
58-
<h1> Note </h1>
59-
60-
`--server-side` is used in `kubectl apply` because a config map created as part of this cluster exceeds the annotations size limit.
61-
More on server side apply can be found [here](https://kubernetes.io/docs/reference/using-api/server-side-apply/)
62-
63-
</aside>
64-
65-
Wait until the cluster and nodes are finished provisioning. The GPU nodes may take several minutes
66-
to provision, since each one must install drivers and supporting software.
52+
Wait until the cluster and nodes are finished provisioning...
6753

6854
```bash
6955
$ kubectl get cluster azure-gpu
@@ -75,38 +61,58 @@ azure-gpu-control-plane-t94nm azure:////subscriptions/<subscription_id>/resou
7561
azure-gpu-md-0-f6b88dd78-vmkph azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v Running v1.22.1
7662
```
7763

78-
Install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice.
79-
Once the nodes are `Ready`, run the following commands against the workload cluster to check if all the `gpu-operator` resources are installed:
64+
... and then you can install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice.
65+
66+
Once all nodes are `Ready`, install the official NVIDIA gpu-operator via Helm.
67+
68+
### Install nvidia gpu-operator Helm chart
69+
70+
If you don't have `helm`, installation instructions for your environment can be found [here](https://helm.sh).
71+
72+
First, grab the kubeconfig from your newly created cluster and save it to a file:
73+
74+
```bash
75+
$ clusterctl get kubeconfig azure-gpu > ./azure-gpu-cluster.conf
76+
```
77+
78+
Now we can use Helm to install the official chart:
8079

8180
```bash
82-
$ clusterctl get kubeconfig azure-gpu > azure-gpu-cluster.conf
83-
$ export KUBECONFIG=azure-gpu-cluster.conf
84-
$ kubectl get pods | grep gpu-operator
85-
default gpu-operator-1612821988-node-feature-discovery-master-664dnsmww 1/1 Running 0 107m
86-
default gpu-operator-1612821988-node-feature-discovery-worker-64mcz 1/1 Running 0 107m
87-
default gpu-operator-1612821988-node-feature-discovery-worker-h5rws 1/1 Running 0 107m
88-
$ kubectl get pods -n gpu-operator-resources
89-
NAME READY STATUS RESTARTS AGE
90-
gpu-feature-discovery-66d4f 1/1 Running 0 2s
91-
nvidia-container-toolkit-daemonset-lxpkx 1/1 Running 0 3m11s
92-
nvidia-dcgm-exporter-wwnsw 1/1 Running 0 5s
93-
nvidia-device-plugin-daemonset-lpdwz 1/1 Running 0 13s
94-
nvidia-device-plugin-validation 0/1 Completed 0 10s
95-
nvidia-driver-daemonset-w6lpb 1/1 Running 0 3m16s
81+
$ helm install --kubeconfig ./azure-gpu-cluster.conf --repo https://helm.ngc.nvidia.com/nvidia gpu-operator --generate-name
9682
```
9783

84+
The installation of GPU drivers via gpu-operator will take several minutes. Coffee or tea may be appropriate at this time.
85+
86+
After a time, you may run the following command against the workload cluster to check if all the `gpu-operator` resources are installed:
87+
88+
```bash
89+
$ kubectl --kubeconfig ./azure-gpu-cluster.conf get pods -o wide | grep 'gpu\|nvidia'
90+
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
91+
default gpu-feature-discovery-r6zgh 1/1 Running 0 7m21s 192.168.132.75 azure-gpu-md-0-gcc8v <none> <none>
92+
default gpu-operator-1674686292-node-feature-discovery-master-79d8pbcg6 1/1 Running 0 8m15s 192.168.96.7 azure-gpu-control-plane-nnb57 <none> <none>
93+
default gpu-operator-1674686292-node-feature-discovery-worker-g9dj2 1/1 Running 0 8m15s 192.168.132.66 gpu-md-0-gcc8v <none> <none>
94+
default gpu-operator-95b545d6f-rmlf2 1/1 Running 0 8m15s 192.168.132.67 gpu-md-0-gcc8v <none> <none>
95+
default nvidia-container-toolkit-daemonset-hstgw 1/1 Running 0 7m21s 192.168.132.70 gpu-md-0-gcc8v <none> <none>
96+
default nvidia-cuda-validator-pdmkl 0/1 Completed 0 3m47s 192.168.132.74 azure-gpu-md-0-gcc8v <none> <none>
97+
default nvidia-dcgm-exporter-wjm7p 1/1 Running 0 7m21s 192.168.132.71 azure-gpu-md-0-gcc8v <none> <none>
98+
default nvidia-device-plugin-daemonset-csv6k 1/1 Running 0 7m21s 192.168.132.73 azure-gpu-md-0-gcc8v <none> <none>
99+
default nvidia-device-plugin-validator-gxzt2 0/1 Completed 0 2m49s 192.168.132.76 azure-gpu-md-0-gcc8v <none> <none>
100+
default nvidia-driver-daemonset-zww52 1/1 Running 0 7m46s 192.168.132.68 azure-gpu-md-0-gcc8v <none> <none>
101+
default nvidia-operator-validator-kjr6m 1/1 Running 0 7m21s 192.168.132.72 azure-gpu-md-0-gcc8v <none> <none>
102+
```
103+
104+
You should see all pods in either a state of `Running` or `Completed`. If that is the case, then you know the driver installation and GPU node configuration is successful.
105+
98106
Then run the following commands against the workload cluster to verify that the
99107
[NVIDIA device plugin](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml)
100108
has initialized and the `nvidia.com/gpu` resource is available:
101109

102110
```bash
103-
$ kubectl -n kube-system get po | grep nvidia
104-
kube-system nvidia-device-plugin-daemonset-d5dn6 1/1 Running 0 16m
105-
$ kubectl get nodes
111+
$ kubectl --kubeconfig ./azure-gpu-cluster.conf get nodes
106112
NAME STATUS ROLES AGE VERSION
107113
azure-gpu-control-plane-nnb57 Ready master 42m v1.22.1
108114
azure-gpu-md-0-gcc8v Ready <none> 38m v1.22.1
109-
$ kubectl get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
115+
$ kubectl --kubeconfig ./azure-gpu-cluster.conf get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
110116
{
111117
"attachable-volumes-azure-disk": "12",
112118
"cpu": "6",
@@ -140,7 +146,7 @@ spec:
140146
limits:
141147
nvidia.com/gpu: 1 # requesting 1 GPU
142148
EOF
143-
$ kubectl apply -f cuda-vector-add.yaml
149+
$ kubectl --kubeconfig ./azure-gpu-cluster.conf apply -f cuda-vector-add.yaml
144150
```
145151

146152
The container will download, run, and perform a [CUDA](https://developer.nvidia.com/cuda-zone)

hack/fetch-nvidia-resources.sh

Lines changed: 0 additions & 60 deletions
This file was deleted.

0 commit comments

Comments
 (0)