@@ -39,31 +39,17 @@ Apply the manifest from the previous step to your management cluster to have CAP
3939workload cluster:
4040
4141``` bash
42- $ kubectl apply -f azure-gpu-cluster.yaml --server-side
42+ $ kubectl apply -f azure-gpu-cluster.yaml
4343cluster.cluster.x-k8s.io/azure-gpu serverside-applied
4444azurecluster.infrastructure.cluster.x-k8s.io/azure-gpu serverside-applied
4545kubeadmcontrolplane.controlplane.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
4646azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
4747machinedeployment.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
4848azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
4949kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
50- clusterresourceset.addons.cluster.x-k8s.io/crs-gpu-operator serverside-applied
51- configmap/nvidia-clusterpolicy-crd serverside-applied
52- configmap/nvidia-gpu-operator-components serverside-applied
53- clusterresourceset.addons.cluster.x-k8s.io/azure-gpu-crs-0 serverside-applied
5450```
5551
56- <aside class =" note " >
57-
58- <h1 > Note </h1 >
59-
60- ` --server-side ` is used in ` kubectl apply ` because a config map created as part of this cluster exceeds the annotations size limit.
61- More on server side apply can be found [ here] ( https://kubernetes.io/docs/reference/using-api/server-side-apply/ )
62-
63- </aside >
64-
65- Wait until the cluster and nodes are finished provisioning. The GPU nodes may take several minutes
66- to provision, since each one must install drivers and supporting software.
52+ Wait until the cluster and nodes are finished provisioning...
6753
6854``` bash
6955$ kubectl get cluster azure-gpu
@@ -75,38 +61,58 @@ azure-gpu-control-plane-t94nm azure:////subscriptions/<subscription_id>/resou
7561azure-gpu-md-0-f6b88dd78-vmkph azure:////subscriptions/< subscription_id> /resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v Running v1.22.1
7662```
7763
78- Install a [ CNI] ( https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution ) of your choice.
79- Once the nodes are ` Ready ` , run the following commands against the workload cluster to check if all the ` gpu-operator ` resources are installed:
64+ ... and then you can install a [ CNI] ( https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution ) of your choice.
65+
66+ Once all nodes are ` Ready ` , install the official NVIDIA gpu-operator via Helm.
67+
68+ ### Install nvidia gpu-operator Helm chart
69+
70+ If you don't have ` helm ` , installation instructions for your environment can be found [ here] ( https://helm.sh ) .
71+
72+ First, grab the kubeconfig from your newly created cluster and save it to a file:
73+
74+ ``` bash
75+ $ clusterctl get kubeconfig azure-gpu > ./azure-gpu-cluster.conf
76+ ```
77+
78+ Now we can use Helm to install the official chart:
8079
8180``` bash
82- $ clusterctl get kubeconfig azure-gpu > azure-gpu-cluster.conf
83- $ export KUBECONFIG=azure-gpu-cluster.conf
84- $ kubectl get pods | grep gpu-operator
85- default gpu-operator-1612821988-node-feature-discovery-master-664dnsmww 1/1 Running 0 107m
86- default gpu-operator-1612821988-node-feature-discovery-worker-64mcz 1/1 Running 0 107m
87- default gpu-operator-1612821988-node-feature-discovery-worker-h5rws 1/1 Running 0 107m
88- $ kubectl get pods -n gpu-operator-resources
89- NAME READY STATUS RESTARTS AGE
90- gpu-feature-discovery-66d4f 1/1 Running 0 2s
91- nvidia-container-toolkit-daemonset-lxpkx 1/1 Running 0 3m11s
92- nvidia-dcgm-exporter-wwnsw 1/1 Running 0 5s
93- nvidia-device-plugin-daemonset-lpdwz 1/1 Running 0 13s
94- nvidia-device-plugin-validation 0/1 Completed 0 10s
95- nvidia-driver-daemonset-w6lpb 1/1 Running 0 3m16s
81+ $ helm install --kubeconfig ./azure-gpu-cluster.conf --repo https://helm.ngc.nvidia.com/nvidia gpu-operator --generate-name
9682```
9783
84+ The installation of GPU drivers via gpu-operator will take several minutes. Coffee or tea may be appropriate at this time.
85+
86+ After a time, you may run the following command against the workload cluster to check if all the ` gpu-operator ` resources are installed:
87+
88+ ``` bash
89+ $ kubectl --kubeconfig ./azure-gpu-cluster.conf get pods -o wide | grep ' gpu\|nvidia'
90+ NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
91+ default gpu-feature-discovery-r6zgh 1/1 Running 0 7m21s 192.168.132.75 azure-gpu-md-0-gcc8v < none> < none>
92+ default gpu-operator-1674686292-node-feature-discovery-master-79d8pbcg6 1/1 Running 0 8m15s 192.168.96.7 azure-gpu-control-plane-nnb57 < none> < none>
93+ default gpu-operator-1674686292-node-feature-discovery-worker-g9dj2 1/1 Running 0 8m15s 192.168.132.66 gpu-md-0-gcc8v < none> < none>
94+ default gpu-operator-95b545d6f-rmlf2 1/1 Running 0 8m15s 192.168.132.67 gpu-md-0-gcc8v < none> < none>
95+ default nvidia-container-toolkit-daemonset-hstgw 1/1 Running 0 7m21s 192.168.132.70 gpu-md-0-gcc8v < none> < none>
96+ default nvidia-cuda-validator-pdmkl 0/1 Completed 0 3m47s 192.168.132.74 azure-gpu-md-0-gcc8v < none> < none>
97+ default nvidia-dcgm-exporter-wjm7p 1/1 Running 0 7m21s 192.168.132.71 azure-gpu-md-0-gcc8v < none> < none>
98+ default nvidia-device-plugin-daemonset-csv6k 1/1 Running 0 7m21s 192.168.132.73 azure-gpu-md-0-gcc8v < none> < none>
99+ default nvidia-device-plugin-validator-gxzt2 0/1 Completed 0 2m49s 192.168.132.76 azure-gpu-md-0-gcc8v < none> < none>
100+ default nvidia-driver-daemonset-zww52 1/1 Running 0 7m46s 192.168.132.68 azure-gpu-md-0-gcc8v < none> < none>
101+ default nvidia-operator-validator-kjr6m 1/1 Running 0 7m21s 192.168.132.72 azure-gpu-md-0-gcc8v < none> < none>
102+ ```
103+
104+ You should see all pods in either a state of ` Running ` or ` Completed ` . If that is the case, then you know the driver installation and GPU node configuration is successful.
105+
98106Then run the following commands against the workload cluster to verify that the
99107[ NVIDIA device plugin] ( https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml )
100108has initialized and the ` nvidia.com/gpu ` resource is available:
101109
102110``` bash
103- $ kubectl -n kube-system get po | grep nvidia
104- kube-system nvidia-device-plugin-daemonset-d5dn6 1/1 Running 0 16m
105- $ kubectl get nodes
111+ $ kubectl --kubeconfig ./azure-gpu-cluster.conf get nodes
106112NAME STATUS ROLES AGE VERSION
107113azure-gpu-control-plane-nnb57 Ready master 42m v1.22.1
108114azure-gpu-md-0-gcc8v Ready < none> 38m v1.22.1
109- $ kubectl get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
115+ $ kubectl --kubeconfig ./azure-gpu-cluster.conf get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
110116{
111117 " attachable-volumes-azure-disk" : " 12" ,
112118 " cpu" : " 6" ,
@@ -140,7 +146,7 @@ spec:
140146 limits:
141147 nvidia.com/gpu: 1 # requesting 1 GPU
142148EOF
143- $ kubectl apply -f cuda-vector-add.yaml
149+ $ kubectl --kubeconfig ./azure-gpu-cluster.conf apply -f cuda-vector-add.yaml
144150```
145151
146152The container will download, run, and perform a [ CUDA] ( https://developer.nvidia.com/cuda-zone )
0 commit comments