Skip to content

Commit 552d0ac

Browse files
committed
[zh] Sync scheduling-gpus.md
1 parent 9d625f4 commit 552d0ac

File tree

1 file changed

+73
-191
lines changed

1 file changed

+73
-191
lines changed

content/zh-cn/docs/tasks/manage-gpus/scheduling-gpus.md

Lines changed: 73 additions & 191 deletions
Original file line numberDiff line numberDiff line change
@@ -16,215 +16,90 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
1616
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
1717

1818
<!--
19-
Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
19+
Kubernetes includes **experimental** support for managing GPUs
2020
(graphical processing units) across several nodes.
2121
22-
This page describes how users can consume GPUs across different Kubernetes versions
23-
and the current limitations.
22+
This page describes how users can consume GPUs, and outlines
23+
some of the limitations in the implementation.
2424
-->
25-
Kubernetes 支持对节点上的 AMD 和 NVIDIA GPU (图形处理单元)进行管理,目前处于**实验**状态。
25+
Kubernetes 支持对若干节点上的 GPU(图形处理单元)进行管理,目前处于**实验**状态。
2626

27-
本页介绍用户如何在不同的 Kubernetes 版本中使用 GPU以及当前存在的一些限制。
27+
本页介绍用户如何使用 GPU 以及当前存在的一些限制。
2828

2929
<!-- body -->
3030

3131
<!--
3232
## Using device plugins
3333
34-
Kubernetes implements {{< glossary_tooltip text="Device Plugins" term_id="device-plugin" >}}
34+
Kubernetes implements {{< glossary_tooltip text="device plugins" term_id="device-plugin" >}}
3535
to let Pods access specialized hardware features such as GPUs.
36-
37-
As an administrator, you have to install GPU drivers from the corresponding
38-
hardware vendor on the nodes and run the corresponding device plugin from the
39-
GPU vendor:
4036
-->
4137
## 使用设备插件 {#using-device-plugins}
4238

43-
Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugins)" term_id="device-plugin" >}}
39+
Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugin)" term_id="device-plugin" >}}
4440
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
4541

46-
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行
47-
来自 GPU 厂商的对应的设备插件。
42+
{{% thirdparty-content %}}
4843

49-
* [AMD](#deploying-amd-gpu-device-plugin)
50-
* [NVIDIA](#deploying-nvidia-gpu-device-plugin)
44+
<!--
45+
As an administrator, you have to install GPU drivers from the corresponding
46+
hardware vendor on the nodes and run the corresponding device plugin from the
47+
GPU vendor:
48+
-->
49+
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行来自
50+
GPU 厂商的对应设备插件。
5151

52+
* [AMD](https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment)
53+
* [Intel](https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html)
54+
* [NVIDIA](https://github.com/NVIDIA/k8s-device-plugin#quick-start)
5255
<!--
53-
When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
54-
`nvidia.com/gpu` as a schedulable resource.
56+
Once you have installed the plugin, your cluster exposes a custom schedulable
57+
resource such as `amd.com/gpu` or `nvidia.com/gpu`.
5558
5659
You can consume these GPUs from your containers by requesting
57-
`<vendor>.com/gpu` the same way you request `cpu` or `memory`.
58-
However, there are some limitations in how you specify the resource requirements
59-
when using GPUs:
60+
the custom GPU resource, the same way you request `cpu` or `memory`.
61+
However, there are some limitations in how you specify the resource
62+
requirements for custom devices.
6063
-->
61-
当以上条件满足时,Kubernetes 将暴露 `amd.com/gpu``nvidia.com/gpu`
62-
可调度的资源。
64+
一旦你安装了插件,你的集群就会暴露一个自定义可调度的资源,例如 `amd.com/gpu``nvidia.com/gpu`
6365

64-
你可以通过请求 `<vendor>.com/gpu` 资源来使用 GPU 设备,就像你为 CPU
65-
和内存所做的那样。
66-
不过,使用 GPU 时,在如何指定资源需求这个方面还是有一些限制的:
66+
你可以通过请求这个自定义的 GPU 资源在你的容器中使用这些 GPU,其请求方式与请求 `cpu``memory` 时相同。
67+
不过,在如何指定自定义设备的资源请求方面存在一些限制。
6768

6869
<!--
69-
- GPUs are only supposed to be specified in the `limits` section, which means:
70-
* You can specify GPU `limits` without specifying `requests` because
71-
Kubernetes will use the limit as the request value by default.
72-
* You can specify GPU in both `limits` and `requests` but these two values
73-
must be equal.
74-
* You cannot specify GPU `requests` without specifying `limits`.
75-
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
76-
- Each container can request one or more GPUs. It is not possible to request a
77-
fraction of a GPU.
70+
GPUs are only supposed to be specified in the `limits` section, which means:
71+
* You can specify GPU `limits` without specifying `requests`, because
72+
Kubernetes will use the limit as the request value by default.
73+
* You can specify GPU in both `limits` and `requests` but these two values
74+
must be equal.
75+
* You cannot specify GPU `requests` without specifying `limits`.
7876
-->
79-
- GPU 只能设置在 `limits` 部分,这意味着:
80-
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,Kubernetes 将使用限制
81-
值作为默认的请求值;
77+
- GPU 只能在 `limits` 部分指定,这意味着:
78+
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`因为 Kubernetes 将默认使用限制
79+
值作为请求值。
8280
* 你可以同时指定 `limits``requests`,不过这两个值必须相等。
8381
* 你不可以仅指定 `requests` 而不指定 `limits`
84-
- 容器(以及 Pod)之间是不共享 GPU 的。GPU 也不可以过量分配(Overcommitting)。
85-
- 每个容器可以请求一个或者多个 GPU,但是用小数值来请求部分 GPU 是不允许的。
8682

8783
<!--
88-
Here's an example:
84+
Here's an example manifest for a Pod that requests a GPU:
8985
-->
90-
这里是一个例子
86+
以下是一个 Pod 请求 GPU 的示例清单
9187

9288
```yaml
9389
apiVersion: v1
9490
kind: Pod
9591
metadata:
96-
name: cuda-vector-add
92+
name: example-vector-add
9793
spec:
9894
restartPolicy: OnFailure
9995
containers:
100-
- name: cuda-vector-add
101-
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
102-
image: "registry.k8s.io/cuda-vector-add:v0.1"
96+
- name: example-vector-add
97+
image: "registry.example/example-vector-add:v42"
10398
resources:
10499
limits:
105-
nvidia.com/gpu: 1 # requesting 1 GPU
106-
```
107-
108-
<!--
109-
### Deploying AMD GPU device plugin
110-
111-
The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
112-
has the following requirements:
113-
-->
114-
### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
115-
116-
[官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin)有以下要求:
117-
118-
<!--
119-
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
120-
121-
To deploy the AMD device plugin once your cluster is running and the above
122-
requirements are satisfied:
123-
-->
124-
- Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
125-
126-
如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
127-
128-
```shell
129-
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
100+
gpu-vendor.example/example-gpu: 1 # 请求 1 个 GPU
130101
```
131102
132-
<!--
133-
You can report issues with this third-party device plugin by logging an issue in
134-
[RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
135-
-->
136-
你可以到 [RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
137-
项目报告有关此设备插件的问题。
138-
139-
<!--
140-
### Deploying NVIDIA GPU device plugin
141-
142-
There are currently two device plugin implementations for NVIDIA GPUs:
143-
-->
144-
### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
145-
146-
对于 NVIDIA GPU,目前存在两种设备插件的实现:
147-
148-
<!--
149-
#### Official NVIDIA GPU device plugin
150-
151-
The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
152-
has the following requirements:
153-
-->
154-
#### 官方的 NVIDIA GPU 设备插件
155-
156-
[官方的 NVIDIA GPU 设备插件](https://github.com/NVIDIA/k8s-device-plugin) 有以下要求:
157-
158-
<!--
159-
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
160-
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
161-
- Kubelet must use Docker as its container runtime
162-
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
163-
for Docker, instead of runc.
164-
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
165-
166-
To deploy the NVIDIA device plugin once your cluster is running and the above
167-
requirements are satisfied:
168-
-->
169-
- Kubernetes 的节点必须预先安装了 NVIDIA 驱动
170-
- Kubernetes 的节点必须预先安装 [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
171-
- Kubelet 的容器运行时必须使用 Docker
172-
- Docker 的[默认运行时](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)必须设置为
173-
`nvidia-container-runtime`,而不是 `runc`
174-
- NVIDIA 驱动程序的版本必须匹配 ~= 384.81
175-
176-
如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
177-
178-
```shell
179-
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
180-
```
181-
182-
<!--
183-
You can report issues with this third-party device plugin by logging an issue in
184-
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
185-
-->
186-
你可以通过在 [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) 中记录问题来报告此第三方设备插件的问题。
187-
188-
<!--
189-
#### NVIDIA GPU device plugin used by GCE
190-
191-
The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
192-
doesn't require using nvidia-docker and should work with any container runtime
193-
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
194-
on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
195-
and has experimental code for Ubuntu from 1.9 onwards.
196-
-->
197-
#### GCE 中使用的 NVIDIA GPU 设备插件
198-
199-
[GCE 使用的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) 并不要求使用 nvidia-docker,并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [Container-Optimized OS](https://cloud.google.com/container-optimized-os/) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
200-
201-
<!--
202-
You can use the following commands to install the NVIDIA drivers and device plugin:
203-
-->
204-
你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
205-
206-
```shell
207-
# 在 Container-Optimized OS 上安装 NVIDIA 驱动:
208-
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
209-
210-
# 在 Ubuntu 上安装 NVIDIA 驱动 (实验性质):
211-
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
212-
213-
# 安装设备插件:
214-
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
215-
```
216-
217-
<!--
218-
You can report issues with using or deploying this third-party device plugin by logging an issue in
219-
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
220-
221-
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
222-
-->
223-
你可以通过在 [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
224-
中记录问题来报告使用或部署此第三方设备插件的问题。
225-
226-
关于如何在 GKE 上使用 NVIDIA GPU,Google 也提供自己的[指令](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)
227-
228103
<!--
229104
## Clusters containing different types of GPUs
230105
@@ -234,20 +109,26 @@ to schedule pods to appropriate nodes.
234109
235110
For example:
236111
-->
237-
## 集群内存在不同类型的 GPU
112+
## 集群内存在不同类型的 GPU {#clusters-containing-different-types-of-gpus}
238113
239-
如果集群内部的不同节点上有不同类型的 NVIDIA GPU,那么你可以使用
240-
[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)
241-
来将 pod 调度到合适的节点上。
114+
如果集群内部的不同节点上有不同类型的 NVIDIA GPU,
115+
那么你可以使用[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)来将
116+
Pod 调度到合适的节点上。
242117
243118
例如:
244119
245120
```shell
246121
# 为你的节点加上它们所拥有的加速器类型的标签
247-
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
248-
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
122+
kubectl label nodes node1 accelerator=example-gpu-x100
123+
kubectl label nodes node2 accelerator=other-gpu-k915
249124
```
250125

126+
<!--
127+
That label key `accelerator` is just an example; you can use
128+
a different label key if you prefer.
129+
-->
130+
这个标签键 `accelerator` 只是一个例子;如果你愿意,可以使用不同的标签键。
131+
251132
<!--
252133
## Automatic node labelling {#node-labeller}
253134
-->
@@ -280,7 +161,6 @@ At the moment, that controller can add labels for:
280161
* CZ - Carrizo
281162
* AI - Arctic Islands
282163
* RV - Raven
283-
Example result:
284164
--->
285165
* 设备 ID (-device-id)
286166
* VRAM 大小 (-vram)
@@ -296,26 +176,23 @@ Example result:
296176
* AI - Arctic Islands
297177
* RV - Raven
298178

299-
示例:
300-
301179
```shell
302180
kubectl describe node cluster-node-23
303181
```
304182

305183
```
306-
Name: cluster-node-23
307-
Roles: <none>
308-
Labels: beta.amd.com/gpu.cu-count.64=1
309-
beta.amd.com/gpu.device-id.6860=1
310-
beta.amd.com/gpu.family.AI=1
311-
beta.amd.com/gpu.simd-count.256=1
312-
beta.amd.com/gpu.vram.16G=1
313-
beta.kubernetes.io/arch=amd64
314-
beta.kubernetes.io/os=linux
315-
kubernetes.io/hostname=cluster-node-23
316-
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
317-
node.alpha.kubernetes.io/ttl: 0
318-
184+
Name: cluster-node-23
185+
Roles: <none>
186+
Labels: beta.amd.com/gpu.cu-count.64=1
187+
beta.amd.com/gpu.device-id.6860=1
188+
beta.amd.com/gpu.family.AI=1
189+
beta.amd.com/gpu.simd-count.256=1
190+
beta.amd.com/gpu.vram.16G=1
191+
kubernetes.io/arch=amd64
192+
kubernetes.io/os=linux
193+
kubernetes.io/hostname=cluster-node-23
194+
Annotations: node.alpha.kubernetes.io/ttl: 0
195+
319196
```
320197

321198
<!--
@@ -337,12 +214,17 @@ spec:
337214
resources:
338215
limits:
339216
nvidia.com/gpu: 1
340-
nodeSelector:
341-
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
217+
affinity:
218+
nodeAffinity:
219+
requiredDuringSchedulingIgnoredDuringExecution:
220+
nodeSelectorTerms:
221+
– matchExpressions:
222+
– key: beta.amd.com/gpu.family.AI # Arctic Islands GPU 系列
223+
operator: Exist
342224
```
343225
344226
<!--
345-
This will ensure that the pod will be scheduled to a node that has the GPU type
227+
This ensures that the Pod will be scheduled to a node that has the GPU type
346228
you specified.
347229
-->
348230
这能够保证 Pod 能够被调度到你所指定类型的 GPU 的节点上去。

0 commit comments

Comments
 (0)