@@ -3,10 +3,10 @@ title: Deploying a GPU application on OVHcloud Managed Kubernetes Service
3
3
slug : deploying-gpu-application
4
4
excerpt : ' Find out how to deploy a GPU application on OVHcloud Managed Kubernetes'
5
5
section : GPU
6
- order : 0
7
6
routes :
8
- canonical : ' https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application/'
9
- updated : 2022-02-16
7
+ canonical : https://docs.ovh.com/gb/en/kubernetes/deploying-gpu-application/
8
+ order : 0
9
+ updated : 2023-04-26
10
10
---
11
11
12
12
<style >
@@ -31,7 +31,7 @@ updated: 2022-02-16
31
31
}
32
32
</style >
33
33
34
- ** Last updated February 16, 2022 .**
34
+ ** Last updated April 26, 2023 .**
35
35
36
36
## Objective
37
37
@@ -121,14 +121,18 @@ For this tutorial we are using the [NVIDIA GPU Operator Helm chart](https://gith
121
121
122
122
Add the NVIDIA Helm repository:
123
123
124
+ > [ !primary]
125
+ >
126
+ > The Nvidia Helm chart have moved, so if you already added a repo with the name ` nvidia ` , you can remove it: ` helm repo remove nvidia ` .
127
+
124
128
``` bash
125
- helm repo add nvidia https://nvidia.github.io/gpu-operator
129
+ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
126
130
helm repo update
127
131
```
128
132
129
133
This will add the NVIDIA repository and update all of your repositories:
130
134
131
- <pre class =" console " ><code >$ helm repo add nvidia https://nvidia.github.io/gpu-operator
135
+ <pre class =" console " ><code >$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
132
136
helm repo update
133
137
"nvidia" has been added to your repositories
134
138
Hang tight while we grab the latest from your chart repositories...
@@ -146,37 +150,47 @@ helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
146
150
You should have a GPU operator installed and running:
147
151
148
152
<pre class =" console " ><code >$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace --wait
153
+
149
154
NAME: gpu-operator
150
- LAST DEPLOYED: Thu Dec 23 15:27:25 2021
155
+ LAST DEPLOYED: Tue Apr 25 09:59:59 2023
151
156
NAMESPACE: gpu-operator
152
157
STATUS: deployed
153
158
REVISION: 1
154
159
TEST SUITE: None
155
160
156
161
$ kubectl get pod -n gpu-operator
157
162
NAME READY STATUS RESTARTS AGE
158
- gpu-feature-discovery-n7tv8 1/1 Running 0 3m35s
159
- gpu-feature-discovery-xddz2 1/1 Running 0 3m35s
160
- gpu-operator-bb886b456-llmlg 1/1 Running 0 5m31s
161
- gpu-operator-node-feature-discovery-master-58d884d5cc-lxkb8 1/1 Running 0 5m31s
162
- gpu-operator-node-feature-discovery-worker-9pqqq 1/1 Running 0 4m27s
163
- gpu-operator-node-feature-discovery-worker-s5zj9 1/1 Running 0 4m20s
164
- nvidia-container-toolkit-daemonset-424mm 1/1 Running 0 3m36s
165
- nvidia-container-toolkit-daemonset-dqlw9 1/1 Running 0 3m36s
166
- nvidia-cuda-validator-5dzf7 0/1 Completed 0 76s
167
- nvidia-cuda-validator-zp9vd 0/1 Completed 0 95s
168
- nvidia-dcgm-4bstw 1/1 Running 0 3m36s
169
- nvidia-dcgm-4t7zd 1/1 Running 0 3m36s
170
- nvidia-dcgm-exporter-rhtbj 1/1 Running 1 3m35s
171
- nvidia-dcgm-exporter-ttq2t 1/1 Running 0 3m35s
172
- nvidia-device-plugin-daemonset-f8vht 1/1 Running 0 3m36s
173
- nvidia-device-plugin-daemonset-lt9xr 1/1 Running 0 3m36s
174
- nvidia-device-plugin-validator-gj86p 0/1 Completed 0 28s
175
- nvidia-device-plugin-validator-w2vz4 0/1 Completed 0 37s
176
- nvidia-driver-daemonset-2mcft 1/1 Running 0 3m36s
177
- nvidia-driver-daemonset-v9pv9 1/1 Running 0 3m36s
178
- nvidia-operator-validator-g6fbm 1/1 Running 0 3m36s
179
- nvidia-operator-validator-xctsp 1/1 Running 0 3m36s
163
+ gpu-feature-discovery-8xzzw 1/1 Running 0 22m
164
+ gpu-feature-discovery-kxtlh 1/1 Running 0 22m
165
+ gpu-feature-discovery-wdvr7 1/1 Running 0 22m
166
+ gpu-operator-689dbf694b-clz7f 1/1 Running 0 23m
167
+ gpu-operator-node-feature-discovery-master-7db9bfdd5b-9w2hj 1/1 Running 0 23m
168
+ gpu-operator-node-feature-discovery-worker-2wpmm 1/1 Running 0 23m
169
+ gpu-operator-node-feature-discovery-worker-4bsn7 1/1 Running 0 23m
170
+ gpu-operator-node-feature-discovery-worker-9klx5 1/1 Running 0 23m
171
+ gpu-operator-node-feature-discovery-worker-gn62n 1/1 Running 0 23m
172
+ gpu-operator-node-feature-discovery-worker-hdzpx 1/1 Running 0 23m
173
+ nvidia-container-toolkit-daemonset-hvx6x 1/1 Running 0 22m
174
+ nvidia-container-toolkit-daemonset-lhmxn 1/1 Running 0 22m
175
+ nvidia-container-toolkit-daemonset-tjrb2 1/1 Running 0 22m
176
+ nvidia-cuda-validator-fcfwn 0/1 Completed 0 18m
177
+ nvidia-cuda-validator-mdbml 0/1 Completed 0 18m
178
+ nvidia-cuda-validator-sv979 0/1 Completed 0 17m
179
+ nvidia-dcgm-exporter-fvn8h 1/1 Running 0 22m
180
+ nvidia-dcgm-exporter-mt5qh 1/1 Running 0 22m
181
+ nvidia-dcgm-exporter-n65kl 1/1 Running 0 22m
182
+ nvidia-device-plugin-daemonset-hwc95 1/1 Running 0 22m
183
+ nvidia-device-plugin-daemonset-wr5td 1/1 Running 0 22m
184
+ nvidia-device-plugin-daemonset-zzzkm 1/1 Running 0 22m
185
+ nvidia-device-plugin-validator-4k5wd 0/1 Completed 0 17m
186
+ nvidia-device-plugin-validator-rjkzd 0/1 Completed 0 17m
187
+ nvidia-device-plugin-validator-swdrr 0/1 Completed 0 17m
188
+ nvidia-driver-daemonset-2jsmv 1/1 Running 0 22m
189
+ nvidia-driver-daemonset-5zq44 1/1 Running 0 22m
190
+ nvidia-driver-daemonset-v6qgx 1/1 Running 0 22m
191
+ nvidia-operator-validator-kk6nd 1/1 Running 0 22m
192
+ nvidia-operator-validator-m9p9k 1/1 Running 0 22m
193
+ nvidia-operator-validator-s6czx 1/1 Running 0 22m
180
194
</code ></pre >
181
195
182
196
### Verify GPU Operator Install
@@ -215,7 +229,7 @@ spec:
215
229
restartPolicy : OnFailure
216
230
containers :
217
231
- name : cuda-vectoradd
218
- image : " nvidia/samples :vectoradd-cuda11.2 .1"
232
+ image : " nvcr.io/ nvidia/k8s/cuda-sample :vectoradd-cuda11.7 .1"
219
233
resources :
220
234
limits :
221
235
nvidia.com/gpu : 1
@@ -261,87 +275,6 @@ Done
261
275
262
276
Our first GPU workload is just started up and has done its task in our OVHcloud Managed Kubernetes cluster.
263
277
264
- ### Running Load Test GPU Application
265
-
266
- After deploying your first application using GPU, you can now run a load test GPU application.
267
-
268
- To do that you have to use the ` nvidia-smi ` (System Management Interface) in any container with the proper runtime.
269
-
270
- To see this in action, create a ` my-load-gpu-pod.yml ` YAML manifest file with the following content:
271
-
272
- ``` yaml
273
- apiVersion : v1
274
- kind : Pod
275
- metadata :
276
- name : dcgmproftester
277
- spec :
278
- restartPolicy : OnFailure
279
- containers :
280
- - name : dcgmproftester
281
- image : nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
282
- args : ["--no-dcgm-validation", "-t 1004", "-d 240"]
283
- resources :
284
- limits :
285
- nvidia.com/gpu : 1
286
- securityContext :
287
- capabilities :
288
- add : ["SYS_ADMIN"]
289
- ` ` `
290
-
291
- Apply it:
292
-
293
- ` ` ` bash
294
- kubectl apply -f my-load-gpu-pod.yml -n default
295
- ```
296
-
297
- And watch the Pod startup:
298
-
299
- ``` bash
300
- kubectl get pod -n default -w
301
- ```
302
-
303
- This will create a Pod using the Nvidia ` dcgmproftester ` to generate a test GPU load:
304
-
305
- <pre class =" console " ><code >$ kubectl apply -f my-load-gpu-pod.yml -n default
306
- pod/dcgmproftester created
307
-
308
- $ kubectl get po -w
309
- NAME READY STATUS RESTARTS AGE
310
- ...
311
- dcgmproftester 1/1 Running 0 7s
312
- </code ></pre >
313
-
314
- Then, execute into the pod:
315
-
316
- ``` bash
317
- kubectl exec -it dcgmproftester -- nvidia-smi -n default
318
- ```
319
-
320
- <pre class =" console " ><code >$ kubectl exec -it dcgmproftester -- nvidia-smi
321
-
322
- Fri Dec 24 13:36:50 2021
323
- +-----------------------------------------------------------------------------+
324
- | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
325
- |-------------------------------+----------------------+----------------------+
326
- | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
327
- | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
328
- | | | MIG M. |
329
- |===============================+======================+======================|
330
- | 0 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 |
331
- | N/A 47C P0 214W / 250W | 491MiB / 16160MiB | 79% Default |
332
- | | | N/A |
333
- +-------------------------------+----------------------+----------------------+
334
-
335
- +-----------------------------------------------------------------------------+
336
- | Processes: |
337
- | GPU GI CI PID Type Process name GPU Memory |
338
- | ID ID Usage |
339
- |=============================================================================|
340
- +-----------------------------------------------------------------------------+
341
- </code ></pre >
342
-
343
- You can see your test load under ` GPU-Util ` (third column), along with other information such as ` Memory-Usage ` (second column).
344
-
345
278
## Go further
346
279
347
280
To learn more about using your Kubernetes cluster the practical way, we invite you to look at our [ OVHcloud Managed Kubernetes documentation] ( ../ ) .
0 commit comments