Skip to content

Commit 1cc9638

Browse files
Merge pull request #4543 from ovh/ava-gpu-app
Update deploying gpu app doc to be compliant with ubuntu 22 nodes migration
2 parents f38497e + c55039c commit 1cc9638

File tree

15 files changed

+647
-1652
lines changed

15 files changed

+647
-1652
lines changed

pages/platform/kubernetes-k8s/deploying-gpu-application/guide.de-de.md

Lines changed: 44 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ title: Deploying a GPU application on OVHcloud Managed Kubernetes Service
33
slug: deploying-gpu-application
44
excerpt: 'Find out how to deploy a GPU application on OVHcloud Managed Kubernetes'
55
section: GPU
6-
order: 0
76
routes:
8-
canonical: 'https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application/'
9-
updated: 2022-02-16
7+
canonical: https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application/
8+
order: 0
9+
updated: 2023-04-26
1010
---
1111

1212
<style>
@@ -31,7 +31,7 @@ updated: 2022-02-16
3131
}
3232
</style>
3333

34-
**Last updated February 16, 2022.**
34+
**Last updated April 26th, 2023.**
3535

3636
## Objective
3737

@@ -121,14 +121,18 @@ For this tutorial we are using the [NVIDIA GPU Operator Helm chart](https://gith
121121

122122
Add the NVIDIA Helm repository:
123123

124+
> [!primary]
125+
>
126+
> The Nvidia Helm chart has moved. If you already added a repo with the name `nvidia`, you can remove it: `helm repo remove nvidia`.
127+
124128
```bash
125-
helm repo add nvidia https://nvidia.github.io/gpu-operator
129+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
126130
helm repo update
127131
```
128132

129133
This will add the NVIDIA repository and update all of your repositories:
130134

131-
<pre class="console"><code>$ helm repo add nvidia https://nvidia.github.io/gpu-operator
135+
<pre class="console"><code>$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
132136
helm repo update
133137
"nvidia" has been added to your repositories
134138
Hang tight while we grab the latest from your chart repositories...
@@ -146,37 +150,47 @@ helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
146150
You should have a GPU operator installed and running:
147151

148152
<pre class="console"><code>$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace --wait
153+
149154
NAME: gpu-operator
150-
LAST DEPLOYED: Thu Dec 23 15:27:25 2021
155+
LAST DEPLOYED: Tue Apr 25 09:59:59 2023
151156
NAMESPACE: gpu-operator
152157
STATUS: deployed
153158
REVISION: 1
154159
TEST SUITE: None
155160

156161
$ kubectl get pod -n gpu-operator
157162
NAME READY STATUS RESTARTS AGE
158-
gpu-feature-discovery-n7tv8 1/1 Running 0 3m35s
159-
gpu-feature-discovery-xddz2 1/1 Running 0 3m35s
160-
gpu-operator-bb886b456-llmlg 1/1 Running 0 5m31s
161-
gpu-operator-node-feature-discovery-master-58d884d5cc-lxkb8 1/1 Running 0 5m31s
162-
gpu-operator-node-feature-discovery-worker-9pqqq 1/1 Running 0 4m27s
163-
gpu-operator-node-feature-discovery-worker-s5zj9 1/1 Running 0 4m20s
164-
nvidia-container-toolkit-daemonset-424mm 1/1 Running 0 3m36s
165-
nvidia-container-toolkit-daemonset-dqlw9 1/1 Running 0 3m36s
166-
nvidia-cuda-validator-5dzf7 0/1 Completed 0 76s
167-
nvidia-cuda-validator-zp9vd 0/1 Completed 0 95s
168-
nvidia-dcgm-4bstw 1/1 Running 0 3m36s
169-
nvidia-dcgm-4t7zd 1/1 Running 0 3m36s
170-
nvidia-dcgm-exporter-rhtbj 1/1 Running 1 3m35s
171-
nvidia-dcgm-exporter-ttq2t 1/1 Running 0 3m35s
172-
nvidia-device-plugin-daemonset-f8vht 1/1 Running 0 3m36s
173-
nvidia-device-plugin-daemonset-lt9xr 1/1 Running 0 3m36s
174-
nvidia-device-plugin-validator-gj86p 0/1 Completed 0 28s
175-
nvidia-device-plugin-validator-w2vz4 0/1 Completed 0 37s
176-
nvidia-driver-daemonset-2mcft 1/1 Running 0 3m36s
177-
nvidia-driver-daemonset-v9pv9 1/1 Running 0 3m36s
178-
nvidia-operator-validator-g6fbm 1/1 Running 0 3m36s
179-
nvidia-operator-validator-xctsp 1/1 Running 0 3m36s
163+
gpu-feature-discovery-8xzzw 1/1 Running 0 22m
164+
gpu-feature-discovery-kxtlh 1/1 Running 0 22m
165+
gpu-feature-discovery-wdvr7 1/1 Running 0 22m
166+
gpu-operator-689dbf694b-clz7f 1/1 Running 0 23m
167+
gpu-operator-node-feature-discovery-master-7db9bfdd5b-9w2hj 1/1 Running 0 23m
168+
gpu-operator-node-feature-discovery-worker-2wpmm 1/1 Running 0 23m
169+
gpu-operator-node-feature-discovery-worker-4bsn7 1/1 Running 0 23m
170+
gpu-operator-node-feature-discovery-worker-9klx5 1/1 Running 0 23m
171+
gpu-operator-node-feature-discovery-worker-gn62n 1/1 Running 0 23m
172+
gpu-operator-node-feature-discovery-worker-hdzpx 1/1 Running 0 23m
173+
nvidia-container-toolkit-daemonset-hvx6x 1/1 Running 0 22m
174+
nvidia-container-toolkit-daemonset-lhmxn 1/1 Running 0 22m
175+
nvidia-container-toolkit-daemonset-tjrb2 1/1 Running 0 22m
176+
nvidia-cuda-validator-fcfwn 0/1 Completed 0 18m
177+
nvidia-cuda-validator-mdbml 0/1 Completed 0 18m
178+
nvidia-cuda-validator-sv979 0/1 Completed 0 17m
179+
nvidia-dcgm-exporter-fvn8h 1/1 Running 0 22m
180+
nvidia-dcgm-exporter-mt5qh 1/1 Running 0 22m
181+
nvidia-dcgm-exporter-n65kl 1/1 Running 0 22m
182+
nvidia-device-plugin-daemonset-hwc95 1/1 Running 0 22m
183+
nvidia-device-plugin-daemonset-wr5td 1/1 Running 0 22m
184+
nvidia-device-plugin-daemonset-zzzkm 1/1 Running 0 22m
185+
nvidia-device-plugin-validator-4k5wd 0/1 Completed 0 17m
186+
nvidia-device-plugin-validator-rjkzd 0/1 Completed 0 17m
187+
nvidia-device-plugin-validator-swdrr 0/1 Completed 0 17m
188+
nvidia-driver-daemonset-2jsmv 1/1 Running 0 22m
189+
nvidia-driver-daemonset-5zq44 1/1 Running 0 22m
190+
nvidia-driver-daemonset-v6qgx 1/1 Running 0 22m
191+
nvidia-operator-validator-kk6nd 1/1 Running 0 22m
192+
nvidia-operator-validator-m9p9k 1/1 Running 0 22m
193+
nvidia-operator-validator-s6czx 1/1 Running 0 22m
180194
</code></pre>
181195

182196
### Verify GPU Operator Install
@@ -215,7 +229,7 @@ spec:
215229
restartPolicy: OnFailure
216230
containers:
217231
- name: cuda-vectoradd
218-
image: "nvidia/samples:vectoradd-cuda11.2.1"
232+
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1"
219233
resources:
220234
limits:
221235
nvidia.com/gpu: 1
@@ -261,87 +275,6 @@ Done
261275

262276
Our first GPU workload is just started up and has done its task in our OVHcloud Managed Kubernetes cluster.
263277

264-
### Running Load Test GPU Application
265-
266-
After deploying your first application using GPU, you can now run a load test GPU application.
267-
268-
To do that you have to use the `nvidia-smi` (System Management Interface) in any container with the proper runtime.
269-
270-
To see this in action, create a `my-load-gpu-pod.yml` YAML manifest file with the following content:
271-
272-
```yaml
273-
apiVersion: v1
274-
kind: Pod
275-
metadata:
276-
name: dcgmproftester
277-
spec:
278-
restartPolicy: OnFailure
279-
containers:
280-
- name: dcgmproftester
281-
image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
282-
args: ["--no-dcgm-validation", "-t 1004", "-d 240"]
283-
resources:
284-
limits:
285-
nvidia.com/gpu: 1
286-
securityContext:
287-
capabilities:
288-
add: ["SYS_ADMIN"]
289-
```
290-
291-
Apply it:
292-
293-
```bash
294-
kubectl apply -f my-load-gpu-pod.yml -n default
295-
```
296-
297-
And watch the Pod startup:
298-
299-
```bash
300-
kubectl get pod -n default -w
301-
```
302-
303-
This will create a Pod using the Nvidia `dcgmproftester` to generate a test GPU load:
304-
305-
<pre class="console"><code>$ kubectl apply -f my-load-gpu-pod.yml -n default
306-
pod/dcgmproftester created
307-
308-
$ kubectl get po -w
309-
NAME READY STATUS RESTARTS AGE
310-
...
311-
dcgmproftester 1/1 Running 0 7s
312-
</code></pre>
313-
314-
Then, execute into the pod:
315-
316-
```bash
317-
kubectl exec -it dcgmproftester -- nvidia-smi -n default
318-
```
319-
320-
<pre class="console"><code>$ kubectl exec -it dcgmproftester -- nvidia-smi
321-
322-
Fri Dec 24 13:36:50 2021
323-
+-----------------------------------------------------------------------------+
324-
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
325-
|-------------------------------+----------------------+----------------------+
326-
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
327-
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
328-
| | | MIG M. |
329-
|===============================+======================+======================|
330-
| 0 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 |
331-
| N/A 47C P0 214W / 250W | 491MiB / 16160MiB | 79% Default |
332-
| | | N/A |
333-
+-------------------------------+----------------------+----------------------+
334-
335-
+-----------------------------------------------------------------------------+
336-
| Processes: |
337-
| GPU GI CI PID Type Process name GPU Memory |
338-
| ID ID Usage |
339-
|=============================================================================|
340-
+-----------------------------------------------------------------------------+
341-
</code></pre>
342-
343-
You can see your test load under `GPU-Util` (third column), along with other information such as `Memory-Usage` (second column).
344-
345278
## Go further
346279

347280
To learn more about using your Kubernetes cluster the practical way, we invite you to look at our [OVHcloud Managed Kubernetes documentation](../).

0 commit comments

Comments
 (0)