@@ -100,22 +100,29 @@ spec:
100
100
` ` `
101
101
102
102
<!--
103
- ## Clusters containing different types of GPUs
103
+ ## Manage clusters with different types of GPUs
104
104
105
105
If different nodes in your cluster have different types of GPUs, then you
106
106
can use [Node Labels and Node Selectors](/docs/tasks/configure-pod-container/assign-pods-nodes/)
107
107
to schedule pods to appropriate nodes.
108
108
109
109
For example:
110
110
-->
111
- ## 集群内存在不同类型的 GPU {# clusters-containing -different-types-of-gpus}
111
+ ## 管理配有不同类型 GPU 的集群 {#manage- clusters-with -different-types-of-gpus}
112
112
113
113
如果集群内部的不同节点上有不同类型的 NVIDIA GPU,
114
114
那么你可以使用[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)来将
115
115
Pod 调度到合适的节点上。
116
116
117
117
例如:
118
118
119
+ <!--
120
+ ` ` ` shell
121
+ # Label your nodes with the accelerator type they have.
122
+ kubectl label nodes node1 accelerator=example-gpu-x100
123
+ kubectl label nodes node2 accelerator=other-gpu-k915
124
+ ```
125
+ -->
119
126
``` shell
120
127
# 为你的节点加上它们所拥有的加速器类型的标签
121
128
kubectl label nodes node1 accelerator=example-gpu-x100
@@ -134,18 +141,92 @@ a different label key if you prefer.
134
141
## 自动节点标签 {#node-labeller}
135
142
136
143
<!--
137
- If you're using AMD GPU devices, you can deploy
138
- [Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller).
139
- Node Labeller is a {{< glossary_tooltip text="controller" term_id="controller" >}} that automatically
140
- labels your nodes with GPU device properties.
144
+ As an administrator, you can automatically discover and label all your GPU enabled nodes
145
+ by deploying Kubernetes [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) (NFD).
146
+ NFD detects the hardware features that are available on each node in a Kubernetes cluster.
147
+ Typically, NFD is configured to advertise those features as node labels, but NFD can also add extended resources, annotations, and node taints.
148
+ NFD is compatible with all [supported versions](/releases/version-skew-policy/#supported-versions) of Kubernetes.
149
+ By default NFD create the [feature labels](https://kubernetes-sigs.github.io/node-feature-discovery/master/usage/features.html) for the detected features.
150
+ Administrators can leverage NFD to also taint nodes with specific features, so that only pods that request those features can be scheduled on those nodes.
151
+ -->
152
+ 作为管理员,你可以通过部署 Kubernetes
153
+ [ Node Feature Discovery] ( https://github.com/kubernetes-sigs/node-feature-discovery ) (NFD)
154
+ 来自动发现所有启用 GPU 的节点并为其打标签。NFD 检测 Kubernetes 集群中每个节点上可用的硬件特性。
155
+ 通常,NFD 被配置为以节点标签广告这些特性,但 NFD 也可以添加扩展的资源、注解和节点污点。
156
+ NFD 兼容所有[ 支持版本] ( /zh-cn/releases/version-skew-policy/#supported-versions ) 的 Kubernetes。
157
+ NFD 默认会为检测到的特性创建[ 特性标签] ( https://kubernetes-sigs.github.io/node-feature-discovery/master/usage/features.html ) 。
158
+ 管理员可以利用 NFD 对具有某些具体特性的节点添加污点,以便只有请求这些特性的 Pod 可以被调度到这些节点上。
159
+
160
+ <!--
161
+ You also need a plugin for NFD that adds appropriate labels to your nodes; these might be generic
162
+ labels or they could be vendor specific. Your GPU vendor may provide a third party
163
+ plugin for NFD; check their documentation for more details.
164
+ -->
165
+ 你还需要一个 NFD 插件,将适当的标签添加到你的节点上;
166
+ 这些标签可以是通用的,也可以是供应商特定的。你的 GPU 供应商可能会为 NFD 提供第三方插件;
167
+ 更多细节请查阅他们的文档。
168
+
169
+ <!--
170
+ {{< highlight yaml "linenos=false,hl_lines=6-18" >}}
171
+ apiVersion: v1
172
+ kind: Pod
173
+ metadata:
174
+ name: example-vector-add
175
+ spec:
176
+ # You can use Kubernetes node affinity to schedule this Pod onto a node
177
+ # that provides the kind of GPU that its container needs in order to work
178
+ affinity:
179
+ nodeAffinity:
180
+ requiredDuringSchedulingIgnoredDuringExecution:
181
+ nodeSelectorTerms:
182
+ - matchExpressions:
183
+ - key: "gpu.gpu-vendor.example/installed-memory"
184
+ operator: Gt # (greater than)
185
+ values: ["40535"]
186
+ - key: "feature.node.kubernetes.io/pci-10.present" # NFD Feature label
187
+ values: ["true"] # (optional) only schedule on nodes with PCI device 10
188
+ restartPolicy: OnFailure
189
+ containers:
190
+ - name: example-vector-add
191
+ image: "registry.example/example-vector-add:v42"
192
+ resources:
193
+ limits:
194
+ gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
195
+ {{< /highlight >}}
196
+ -->
197
+ {{< highlight yaml "linenos=false,hl_lines=6-18" >}}
198
+ apiVersion: v1
199
+ kind: Pod
200
+ metadata:
201
+ name: example-vector-add
202
+ spec:
203
+ # 你可以使用 Kubernetes 节点亲和性将此 Pod 调度到提供其容器所需的那种 GPU 的节点上
204
+ affinity:
205
+ nodeAffinity:
206
+ requiredDuringSchedulingIgnoredDuringExecution:
207
+ nodeSelectorTerms:
208
+ - matchExpressions:
209
+ - key: "gpu.gpu-vendor.example/installed-memory"
210
+ operator: Gt #(大于)
211
+ values: [ "40535"]
212
+ - key: "feature.node.kubernetes.io/pci-10.present" # NFD 特性标签
213
+ values: [ "true"] #(可选)仅调度到具有 PCI 设备 10 的节点上
214
+ restartPolicy: OnFailure
215
+ containers:
216
+ - name: example-vector-add
217
+ image: "registry.example/example-vector-add: v42 "
218
+ resources:
219
+ limits:
220
+ gpu-vendor.example/example-gpu: 1 # 请求 1 个 GPU
221
+ {{< /highlight >}}
222
+
223
+ <!--
224
+ #### GPU vendor implementations
141
225
142
- Similar functionality for NVIDIA is provided by
143
- [GPU feature discovery ](https://github.com/NVIDIA/gpu-feature-discovery/blob/main/README.md).
226
+ - [Intel](https://intel.github.io/intel-device-plugins- for-kubernetes/cmd/gpu_plugin/README.html)
227
+ - [NVIDIA ](https://github.com/NVIDIA/gpu-feature-discovery/#readme)
144
228
-->
145
- 如果你在使用 AMD GPU,你可以部署
146
- [ Node Labeller] ( https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller ) ,
147
- 它是一个 {{< glossary_tooltip text="控制器" term_id="controller" >}},
148
- 会自动给节点打上 GPU 设备属性标签。
229
+ #### GPU 供应商实现
149
230
150
- 对于 NVIDIA GPU, [ GPU feature discovery ] ( https://github.com/NVIDIA/gpu-feature-discovery/blob/main /README.md )
151
- 提供了类似功能。
231
+ - [ Intel ] ( https://intel. github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin /README.html )
232
+ - [ NVIDIA ] ( https://github.com/NVIDIA/gpu-feature-discovery/#readme )
0 commit comments