@@ -16,215 +16,90 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
16
16
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
17
17
18
18
<!--
19
- Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
19
+ Kubernetes includes **experimental** support for managing GPUs
20
20
(graphical processing units) across several nodes.
21
21
22
- This page describes how users can consume GPUs across different Kubernetes versions
23
- and the current limitations.
22
+ This page describes how users can consume GPUs, and outlines
23
+ some of the limitations in the implementation .
24
24
-->
25
- Kubernetes 支持对节点上的 AMD 和 NVIDIA GPU (图形处理单元)进行管理,目前处于** 实验** 状态。
25
+ Kubernetes 支持对若干节点上的 GPU(图形处理单元)进行管理,目前处于** 实验** 状态。
26
26
27
- 本页介绍用户如何在不同的 Kubernetes 版本中使用 GPU, 以及当前存在的一些限制。
27
+ 本页介绍用户如何使用 GPU 以及当前存在的一些限制。
28
28
29
29
<!-- body -->
30
30
31
31
<!--
32
32
## Using device plugins
33
33
34
- Kubernetes implements {{< glossary_tooltip text="Device Plugins " term_id="device-plugin" >}}
34
+ Kubernetes implements {{< glossary_tooltip text="device plugins " term_id="device-plugin" >}}
35
35
to let Pods access specialized hardware features such as GPUs.
36
-
37
- As an administrator, you have to install GPU drivers from the corresponding
38
- hardware vendor on the nodes and run the corresponding device plugin from the
39
- GPU vendor:
40
36
-->
41
37
## 使用设备插件 {#using-device-plugins}
42
38
43
- Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugins )" term_id="device-plugin" >}}
39
+ Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugin )" term_id="device-plugin" >}}
44
40
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
45
41
46
- 作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行
47
- 来自 GPU 厂商的对应的设备插件。
42
+ {{% thirdparty-content %}}
48
43
49
- * [ AMD] ( #deploying-amd-gpu-device-plugin )
50
- * [ NVIDIA] ( #deploying-nvidia-gpu-device-plugin )
44
+ <!--
45
+ As an administrator, you have to install GPU drivers from the corresponding
46
+ hardware vendor on the nodes and run the corresponding device plugin from the
47
+ GPU vendor:
48
+ -->
49
+ 作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行来自
50
+ GPU 厂商的对应设备插件。
51
51
52
+ * [ AMD] ( https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment )
53
+ * [ Intel] ( https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html )
54
+ * [ NVIDIA] ( https://github.com/NVIDIA/k8s-device-plugin#quick-start )
52
55
<!--
53
- When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
54
- `nvidia .com/gpu` as a schedulable resource .
56
+ Once you have installed the plugin, your cluster exposes a custom schedulable
57
+ resource such as `amd .com/gpu` or `nvidia.com/gpu` .
55
58
56
59
You can consume these GPUs from your containers by requesting
57
- `<vendor>.com/gpu` the same way you request `cpu` or `memory`.
58
- However, there are some limitations in how you specify the resource requirements
59
- when using GPUs:
60
+ the custom GPU resource, the same way you request `cpu` or `memory`.
61
+ However, there are some limitations in how you specify the resource
62
+ requirements for custom devices.
60
63
-->
61
- 当以上条件满足时,Kubernetes 将暴露 ` amd.com/gpu ` 或 ` nvidia.com/gpu ` 为
62
- 可调度的资源。
64
+ 一旦你安装了插件,你的集群就会暴露一个自定义可调度的资源,例如 ` amd.com/gpu ` 或 ` nvidia.com/gpu ` 。
63
65
64
- 你可以通过请求 ` <vendor>.com/gpu ` 资源来使用 GPU 设备,就像你为 CPU
65
- 和内存所做的那样。
66
- 不过,使用 GPU 时,在如何指定资源需求这个方面还是有一些限制的:
66
+ 你可以通过请求这个自定义的 GPU 资源在你的容器中使用这些 GPU,其请求方式与请求 ` cpu ` 或 ` memory ` 时相同。
67
+ 不过,在如何指定自定义设备的资源请求方面存在一些限制。
67
68
68
69
<!--
69
- - GPUs are only supposed to be specified in the `limits` section, which means:
70
- * You can specify GPU `limits` without specifying `requests` because
71
- Kubernetes will use the limit as the request value by default.
72
- * You can specify GPU in both `limits` and `requests` but these two values
73
- must be equal.
74
- * You cannot specify GPU `requests` without specifying `limits`.
75
- - Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
76
- - Each container can request one or more GPUs. It is not possible to request a
77
- fraction of a GPU.
70
+ GPUs are only supposed to be specified in the `limits` section, which means:
71
+ * You can specify GPU `limits` without specifying `requests`, because
72
+ Kubernetes will use the limit as the request value by default.
73
+ * You can specify GPU in both `limits` and `requests` but these two values
74
+ must be equal.
75
+ * You cannot specify GPU `requests` without specifying `limits`.
78
76
-->
79
- - GPU 只能设置在 ` limits ` 部分 ,这意味着:
80
- * 你可以指定 GPU 的 ` limits ` 而不指定其 ` requests ` ,Kubernetes 将使用限制
81
- 值作为默认的请求值;
77
+ - GPU 只能在 ` limits ` 部分指定 ,这意味着:
78
+ * 你可以指定 GPU 的 ` limits ` 而不指定其 ` requests ` ,因为 Kubernetes 将默认使用限制
79
+ 值作为请求值。
82
80
* 你可以同时指定 ` limits ` 和 ` requests ` ,不过这两个值必须相等。
83
81
* 你不可以仅指定 ` requests ` 而不指定 ` limits ` 。
84
- - 容器(以及 Pod)之间是不共享 GPU 的。GPU 也不可以过量分配(Overcommitting)。
85
- - 每个容器可以请求一个或者多个 GPU,但是用小数值来请求部分 GPU 是不允许的。
86
82
87
83
<!--
88
- Here's an example:
84
+ Here's an example manifest for a Pod that requests a GPU :
89
85
-->
90
- 这里是一个例子 :
86
+ 以下是一个 Pod 请求 GPU 的示例清单 :
91
87
92
88
``` yaml
93
89
apiVersion : v1
94
90
kind : Pod
95
91
metadata :
96
- name : cuda -vector-add
92
+ name : example -vector-add
97
93
spec :
98
94
restartPolicy : OnFailure
99
95
containers :
100
- - name : cuda-vector-add
101
- # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
102
- image : " registry.k8s.io/cuda-vector-add:v0.1"
96
+ - name : example-vector-add
97
+ image : " registry.example/example-vector-add:v42"
103
98
resources :
104
99
limits :
105
- nvidia.com/gpu : 1 # requesting 1 GPU
106
- ` ` `
107
-
108
- <!--
109
- ### Deploying AMD GPU device plugin
110
-
111
- The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
112
- has the following requirements:
113
- -->
114
- ### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
115
-
116
- [官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin)有以下要求:
117
-
118
- <!--
119
- - Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
120
-
121
- To deploy the AMD device plugin once your cluster is running and the above
122
- requirements are satisfied:
123
- -->
124
- - Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
125
-
126
- 如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
127
-
128
- ` ` ` shell
129
- kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
100
+ gpu-vendor.example/example-gpu : 1 # 请求 1 个 GPU
130
101
` ` `
131
102
132
- <!--
133
- You can report issues with this third-party device plugin by logging an issue in
134
- [RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
135
- -->
136
- 你可以到 [ RadeonOpenCompute/k8s-device-plugin] ( https://github.com/RadeonOpenCompute/k8s-device-plugin )
137
- 项目报告有关此设备插件的问题。
138
-
139
- <!--
140
- ### Deploying NVIDIA GPU device plugin
141
-
142
- There are currently two device plugin implementations for NVIDIA GPUs:
143
- -->
144
- ### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
145
-
146
- 对于 NVIDIA GPU,目前存在两种设备插件的实现:
147
-
148
- <!--
149
- #### Official NVIDIA GPU device plugin
150
-
151
- The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
152
- has the following requirements:
153
- -->
154
- #### 官方的 NVIDIA GPU 设备插件
155
-
156
- [ 官方的 NVIDIA GPU 设备插件] ( https://github.com/NVIDIA/k8s-device-plugin ) 有以下要求:
157
-
158
- <!--
159
- - Kubernetes nodes have to be pre-installed with NVIDIA drivers.
160
- - Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
161
- - Kubelet must use Docker as its container runtime
162
- - `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
163
- for Docker, instead of runc.
164
- - The version of the NVIDIA drivers must match the constraint ~= 384.81.
165
-
166
- To deploy the NVIDIA device plugin once your cluster is running and the above
167
- requirements are satisfied:
168
- -->
169
- - Kubernetes 的节点必须预先安装了 NVIDIA 驱动
170
- - Kubernetes 的节点必须预先安装 [ nvidia-docker 2.0] ( https://github.com/NVIDIA/nvidia-docker )
171
- - Kubelet 的容器运行时必须使用 Docker
172
- - Docker 的[ 默认运行时] ( https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes ) 必须设置为
173
- ` nvidia-container-runtime ` ,而不是 ` runc ` 。
174
- - NVIDIA 驱动程序的版本必须匹配 ~ = 384.81
175
-
176
- 如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
177
-
178
- ``` shell
179
- kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
180
- ```
181
-
182
- <!--
183
- You can report issues with this third-party device plugin by logging an issue in
184
- [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
185
- -->
186
- 你可以通过在 [ NVIDIA/k8s-device-plugin] ( https://github.com/NVIDIA/k8s-device-plugin ) 中记录问题来报告此第三方设备插件的问题。
187
-
188
- <!--
189
- #### NVIDIA GPU device plugin used by GCE
190
-
191
- The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
192
- doesn't require using nvidia-docker and should work with any container runtime
193
- that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
194
- on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
195
- and has experimental code for Ubuntu from 1.9 onwards.
196
- -->
197
- #### GCE 中使用的 NVIDIA GPU 设备插件
198
-
199
- [ GCE 使用的 NVIDIA GPU 设备插件] ( https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu ) 并不要求使用 nvidia-docker,并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [ Container-Optimized OS] ( https://cloud.google.com/container-optimized-os/ ) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
200
-
201
- <!--
202
- You can use the following commands to install the NVIDIA drivers and device plugin:
203
- -->
204
- 你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
205
-
206
- ``` shell
207
- # 在 Container-Optimized OS 上安装 NVIDIA 驱动:
208
- kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
209
-
210
- # 在 Ubuntu 上安装 NVIDIA 驱动 (实验性质):
211
- kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
212
-
213
- # 安装设备插件:
214
- kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
215
- ```
216
-
217
- <!--
218
- You can report issues with using or deploying this third-party device plugin by logging an issue in
219
- [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
220
-
221
- Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
222
- -->
223
- 你可以通过在 [ GoogleCloudPlatform/container-engine-accelerators] ( https://github.com/GoogleCloudPlatform/container-engine-accelerators )
224
- 中记录问题来报告使用或部署此第三方设备插件的问题。
225
-
226
- 关于如何在 GKE 上使用 NVIDIA GPU,Google 也提供自己的[ 指令] ( https://cloud.google.com/kubernetes-engine/docs/how-to/gpus ) 。
227
-
228
103
<!--
229
104
## Clusters containing different types of GPUs
230
105
@@ -234,20 +109,26 @@ to schedule pods to appropriate nodes.
234
109
235
110
For example:
236
111
-->
237
- ## 集群内存在不同类型的 GPU
112
+ ## 集群内存在不同类型的 GPU {#clusters-containing-different-types-of-gpus}
238
113
239
- 如果集群内部的不同节点上有不同类型的 NVIDIA GPU,那么你可以使用
240
- [ 节点标签和节点选择器] ( /zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/ )
241
- 来将 pod 调度到合适的节点上。
114
+ 如果集群内部的不同节点上有不同类型的 NVIDIA GPU,
115
+ 那么你可以使用 [节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)来将
116
+ Pod 调度到合适的节点上。
242
117
243
118
例如:
244
119
245
120
` ` ` shell
246
121
# 为你的节点加上它们所拥有的加速器类型的标签
247
- kubectl label nodes < node-with-k 80> accelerator=nvidia-tesla-k80
248
- kubectl label nodes < node-with-p 100> accelerator=nvidia-tesla-p100
122
+ kubectl label nodes node1 accelerator=example-gpu-x100
123
+ kubectl label nodes node2 accelerator=other-gpu-k915
249
124
```
250
125
126
+ <!--
127
+ That label key `accelerator` is just an example; you can use
128
+ a different label key if you prefer.
129
+ -->
130
+ 这个标签键 ` accelerator ` 只是一个例子;如果你愿意,可以使用不同的标签键。
131
+
251
132
<!--
252
133
## Automatic node labelling {#node-labeller}
253
134
-->
@@ -280,7 +161,6 @@ At the moment, that controller can add labels for:
280
161
* CZ - Carrizo
281
162
* AI - Arctic Islands
282
163
* RV - Raven
283
- Example result:
284
164
--->
285
165
* 设备 ID (-device-id)
286
166
* VRAM 大小 (-vram)
@@ -296,26 +176,23 @@ Example result:
296
176
* AI - Arctic Islands
297
177
* RV - Raven
298
178
299
- 示例:
300
-
301
179
``` shell
302
180
kubectl describe node cluster-node-23
303
181
```
304
182
305
183
```
306
- Name: cluster-node-23
307
- Roles: <none>
308
- Labels: beta.amd.com/gpu.cu-count.64=1
309
- beta.amd.com/gpu.device-id.6860=1
310
- beta.amd.com/gpu.family.AI=1
311
- beta.amd.com/gpu.simd-count.256=1
312
- beta.amd.com/gpu.vram.16G=1
313
- beta.kubernetes.io/arch=amd64
314
- beta.kubernetes.io/os=linux
315
- kubernetes.io/hostname=cluster-node-23
316
- Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
317
- node.alpha.kubernetes.io/ttl: 0
318
- …
184
+ Name: cluster-node-23
185
+ Roles: <none>
186
+ Labels: beta.amd.com/gpu.cu-count.64=1
187
+ beta.amd.com/gpu.device-id.6860=1
188
+ beta.amd.com/gpu.family.AI=1
189
+ beta.amd.com/gpu.simd-count.256=1
190
+ beta.amd.com/gpu.vram.16G=1
191
+ kubernetes.io/arch=amd64
192
+ kubernetes.io/os=linux
193
+ kubernetes.io/hostname=cluster-node-23
194
+ Annotations: node.alpha.kubernetes.io/ttl: 0
195
+ …
319
196
```
320
197
321
198
<!--
@@ -337,12 +214,17 @@ spec:
337
214
resources :
338
215
limits :
339
216
nvidia.com/gpu : 1
340
- nodeSelector :
341
- accelerator : nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
217
+ affinity :
218
+ nodeAffinity :
219
+ requiredDuringSchedulingIgnoredDuringExecution :
220
+ nodeSelectorTerms :
221
+ – matchExpressions :
222
+ – key : beta.amd.com/gpu.family.AI # Arctic Islands GPU 系列
223
+ operator : Exist
342
224
` ` `
343
225
344
226
<!--
345
- This will ensure that the pod will be scheduled to a node that has the GPU type
227
+ This ensures that the Pod will be scheduled to a node that has the GPU type
346
228
you specified.
347
229
-->
348
230
这能够保证 Pod 能够被调度到你所指定类型的 GPU 的节点上去。
0 commit comments