kubernetes-sigs
diff --git a/‎docs/book/src/topics/gpu.md‎
Lines changed: 43 additions & 37 deletions b/‎docs/book/src/topics/gpu.md‎
Lines changed: 43 additions & 37 deletions
diff --git a/‎hack/fetch-nvidia-resources.sh‎
Lines changed: 0 additions & 60 deletions b/‎hack/fetch-nvidia-resources.sh‎
Lines changed: 0 additions & 60 deletions
@@ -39,31 +39,17 @@ Apply the manifest from the previous step to your management cluster to have CAP
 workload cluster:
 
 ```bash
-$ kubectl apply -f azure-gpu-cluster.yaml --server-side
+$ kubectl apply -f azure-gpu-cluster.yaml
 cluster.cluster.x-k8s.io/azure-gpu serverside-applied
 azurecluster.infrastructure.cluster.x-k8s.io/azure-gpu serverside-applied
 kubeadmcontrolplane.controlplane.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
 azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
 machinedeployment.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
 azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
 kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
-clusterresourceset.addons.cluster.x-k8s.io/crs-gpu-operator serverside-applied
-configmap/nvidia-clusterpolicy-crd serverside-applied
-configmap/nvidia-gpu-operator-components serverside-applied
-clusterresourceset.addons.cluster.x-k8s.io/azure-gpu-crs-0 serverside-applied
 ```
 
-<aside class="note">
-
-<h1> Note </h1>
-
-`--server-side` is used in `kubectl apply` because a config map created as part of this cluster exceeds the annotations size limit.
-More on server side apply can be found [here](https://kubernetes.io/docs/reference/using-api/server-side-apply/)
-
-</aside>
-
-Wait until the cluster and nodes are finished provisioning. The GPU nodes may take several minutes
-to provision, since each one must install drivers and supporting software.
+Wait until the cluster and nodes are finished provisioning...
 
 ```bash
 $ kubectl get cluster azure-gpu
@@ -75,38 +61,58 @@ azure-gpu-control-plane-t94nm    azure:////subscriptions/<subscription_id>/resou
 azure-gpu-md-0-f6b88dd78-vmkph   azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v            Running   v1.22.1
 ```
 
-Install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice.
-Once the nodes are `Ready`, run the following commands against the workload cluster to check if all the `gpu-operator` resources are installed:
+... and then you can install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice.
+
+Once all nodes are `Ready`, install the official NVIDIA gpu-operator via Helm.
+
+### Install nvidia gpu-operator Helm chart
+
+If you don't have `helm`, installation instructions for your environment can be found [here](https://helm.sh).
+
+First, grab the kubeconfig from your newly created cluster and save it to a file:
+
+```bash
+$ clusterctl get kubeconfig azure-gpu > ./azure-gpu-cluster.conf
+```
+
+Now we can use Helm to install the official chart:
 
 ```bash
-$ clusterctl get kubeconfig azure-gpu > azure-gpu-cluster.conf
-$ export KUBECONFIG=azure-gpu-cluster.conf
-$ kubectl get pods | grep gpu-operator
-default                  gpu-operator-1612821988-node-feature-discovery-master-664dnsmww   1/1     Running                 0          107m
-default                  gpu-operator-1612821988-node-feature-discovery-worker-64mcz       1/1     Running                 0          107m
-default                  gpu-operator-1612821988-node-feature-discovery-worker-h5rws       1/1     Running                 0          107m
-$ kubectl get pods -n gpu-operator-resources
-NAME                                       READY   STATUS      RESTARTS   AGE
-gpu-feature-discovery-66d4f                1/1     Running     0          2s
-nvidia-container-toolkit-daemonset-lxpkx   1/1     Running     0          3m11s
-nvidia-dcgm-exporter-wwnsw                 1/1     Running     0          5s
-nvidia-device-plugin-daemonset-lpdwz       1/1     Running     0          13s
-nvidia-device-plugin-validation            0/1     Completed   0          10s
-nvidia-driver-daemonset-w6lpb              1/1     Running     0          3m16s
+$ helm install --kubeconfig ./azure-gpu-cluster.conf --repo https://helm.ngc.nvidia.com/nvidia gpu-operator --generate-name
 ```
 
+The installation of GPU drivers via gpu-operator will take several minutes. Coffee or tea may be appropriate at this time.
+
+After a time, you may run the following command against the workload cluster to check if all the `gpu-operator` resources are installed:
+
+```bash
+$ kubectl --kubeconfig ./azure-gpu-cluster.conf get pods -o wide | grep 'gpu\|nvidia'
+NAMESPACE          NAME                                                              READY   STATUS      RESTARTS   AGE     IP               NODE                                      NOMINATED NODE   READINESS GATES
+default            gpu-feature-discovery-r6zgh                                       1/1     Running     0          7m21s   192.168.132.75   azure-gpu-md-0-gcc8v            <none>           <none>
+default            gpu-operator-1674686292-node-feature-discovery-master-79d8pbcg6   1/1     Running     0          8m15s   192.168.96.7     azure-gpu-control-plane-nnb57   <none>           <none>
+default            gpu-operator-1674686292-node-feature-discovery-worker-g9dj2       1/1     Running     0          8m15s   192.168.132.66   gpu-md-0-gcc8v            <none>           <none>
+default            gpu-operator-95b545d6f-rmlf2                                      1/1     Running     0          8m15s   192.168.132.67   gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-container-toolkit-daemonset-hstgw                          1/1     Running     0          7m21s   192.168.132.70   gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-cuda-validator-pdmkl                                       0/1     Completed   0          3m47s   192.168.132.74   azure-gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-dcgm-exporter-wjm7p                                        1/1     Running     0          7m21s   192.168.132.71   azure-gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-device-plugin-daemonset-csv6k                              1/1     Running     0          7m21s   192.168.132.73   azure-gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-device-plugin-validator-gxzt2                              0/1     Completed   0          2m49s   192.168.132.76   azure-gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-driver-daemonset-zww52                                     1/1     Running     0          7m46s   192.168.132.68   azure-gpu-md-0-gcc8v            <none>           <none>
+default            nvidia-operator-validator-kjr6m                                   1/1     Running     0          7m21s   192.168.132.72   azure-gpu-md-0-gcc8v            <none>           <none>
+```
+
+You should see all pods in either a state of `Running` or `Completed`. If that is the case, then you know the driver installation and GPU node configuration is successful.
+
 Then run the following commands against the workload cluster to verify that the
 [NVIDIA device plugin](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml)
 has initialized and the `nvidia.com/gpu` resource is available:
 
 ```bash
-$ kubectl -n kube-system get po | grep nvidia
-kube-system   nvidia-device-plugin-daemonset-d5dn6                    1/1     Running   0          16m
-$ kubectl get nodes
+$ kubectl --kubeconfig ./azure-gpu-cluster.conf get nodes
 NAME                            STATUS   ROLES    AGE   VERSION
 azure-gpu-control-plane-nnb57   Ready    master   42m   v1.22.1
 azure-gpu-md-0-gcc8v            Ready    <none>   38m   v1.22.1
-$ kubectl get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
+$ kubectl --kubeconfig ./azure-gpu-cluster.conf get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
 {
   "attachable-volumes-azure-disk": "12",
   "cpu": "6",
@@ -140,7 +146,7 @@ spec:
         limits:
           nvidia.com/gpu: 1 # requesting 1 GPU
 EOF
-$ kubectl apply -f cuda-vector-add.yaml
+$ kubectl --kubeconfig ./azure-gpu-cluster.conf apply -f cuda-vector-add.yaml
 ```
 
 The container will download, run, and perform a [CUDA](https://developer.nvidia.com/cuda-zone)