Skip to content

Commit a6b95e9

Browse files
committed
[zh] Add cluster-administration/dra.md
1 parent cfa9a06 commit a6b95e9

File tree

1 file changed

+333
-0
lines changed
  • content/zh-cn/docs/concepts/cluster-administration

1 file changed

+333
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
title: 集群管理员使用动态资源分配的良好实践
3+
content_type: concept
4+
weight: 60
5+
---
6+
<!--
7+
title: Good practices for Dynamic Resource Allocation as a Cluster Admin
8+
content_type: concept
9+
weight: 60
10+
-->
11+
12+
<!-- overview -->
13+
<!--
14+
This page describes good practices when configuring a Kubernetes cluster
15+
utilizing Dynamic Resource Allocation (DRA). These instructions are for cluster
16+
administrators.
17+
-->
18+
本文介绍在利用动态资源分配(DRA)配置 Kubernetes 集群时的良好实践。这些指示说明适用于集群管理员。
19+
20+
<!-- body -->
21+
<!--
22+
## Separate permissions to DRA related APIs
23+
24+
DRA is orchestrated through a number of different APIs. Use authorization tools
25+
(like RBAC, or another solution) to control access to the right APIs depending
26+
on the persona of your user.
27+
28+
In general, DeviceClasses and ResourceSlices should be restricted to admins and
29+
the DRA drivers. Cluster operators that will be deploying Pods with claims will
30+
need access to ResourceClaim and ResourceClaimTemplate APIs; both of these APIs
31+
are namespace scoped.
32+
-->
33+
## 分离 DRA 相关 API 的权限 {#separate-permissions-to-dra-related-apis}
34+
35+
DRA 是通过多个不同的 API 进行编排的。使用鉴权工具(如 RBAC 或其他方案)根据用户的角色来控制对相关 API 的访问权限。
36+
37+
通常情况下,DeviceClass 和 ResourceSlice 应仅限管理员和 DRA 驱动访问。
38+
通过申领机制来部署 Pod 的集群运维人员将需要访问 ResourceClaim API 和 ResourceClaimTemplate API。
39+
这两个 API 的作用范围都是命名空间。
40+
41+
<!--
42+
## DRA driver deployment and maintenance
43+
44+
DRA drivers are third-party applications that run on each node of your cluster
45+
to interface with the hardware of that node and Kubernetes' native DRA
46+
components. The installation procedure depends on the driver you choose, but is
47+
likely deployed as a DaemonSet to all or a selection of the nodes (using node
48+
selectors or similar mechanisms) in your cluster.
49+
-->
50+
## 部署与维护 DRA 驱动 {#dra-driver-deployment-and-maintenance}
51+
52+
DRA 驱动是运行在集群的每个节点上的第三方应用,对接节点的硬件和 Kubernetes 原生的 DRA 组件。
53+
安装方式取决于你所选的驱动,但通常会作为 DaemonSet 部署到集群中所有或部分节点上(可使用节点选择算符或类似机制)。
54+
55+
<!--
56+
### Use drivers with seamless upgrade if available
57+
58+
DRA drivers implement the [`kubeletplugin` package
59+
interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin).
60+
Your driver may support seamless upgrades by implementing a property of this
61+
interface that allows two versions of the same DRA driver to coexist for a short
62+
time. This is only available for kubelet versions 1.33 and above and may not be
63+
supported by your driver for heterogeneous clusters with attached nodes running
64+
older versions of Kubernetes - check your driver's documentation to be sure.
65+
-->
66+
### 使用支持无缝升级的驱动(如可用) {#use-drivers-with-seamless-upgrade-if-available}
67+
68+
DRA 驱动实现
69+
[`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
70+
你的驱动可能通过实现此接口的一个属性,支持两个版本共存一段时间,从而实现无缝升级。
71+
该功能仅适用于 kubelet v1.33 及更高版本,对于运行旧版 Kubernetes 的节点所组成的异构集群,
72+
可能不支持这种功能。请查阅你的驱动文档予以确认。
73+
74+
<!--
75+
If seamless upgrades are available for your situation, consider using it to
76+
minimize scheduling delays when your driver updates.
77+
78+
If you cannot use seamless upgrades, during driver downtime for upgrades you may
79+
observe that:
80+
-->
81+
如果你的环境支持无缝升级,建议使用此功能以最大限度地减少驱动升级期间的调度延迟。
82+
83+
如果你无法使用无缝升级,则在升级期间因驱动停机时,你可能会观察到:
84+
85+
<!--
86+
* Pods cannot start unless the claims they depend on were already prepared for
87+
use.
88+
* Cleanup after the last pod which used a claim gets delayed until the driver is
89+
available again. The pod is not marked as terminated. This prevents reusing
90+
the resources used by the pod for other pods.
91+
* Running pods will continue to run.
92+
-->
93+
* 除非相关申领已准备就绪,否则 Pod 无法启动。
94+
* 在驱动可能之前,使用了申领的最后一个 Pod 的清理操作将延迟。
95+
此 Pod 不会被标记为已终止,这会阻止此 Pod 所用的资源被其他 Pod 重用。
96+
* 运行中的 Pod 将继续运行。
97+
98+
<!--
99+
### Confirm your DRA driver exposes a liveness probe and utilize it
100+
101+
Your DRA driver likely implements a grpc socket for healthchecks as part of DRA
102+
driver good practices. The easiest way to utilize this grpc socket is to
103+
configure it as a liveness probe for the DaemonSet deploying your DRA driver.
104+
Your driver's documentation or deployment tooling may already include this, but
105+
if you are building your configuration separately or not running your DRA driver
106+
as a Kubernetes pod, be sure that your orchestration tooling restarts the DRA
107+
driver on failed healthchecks to this grpc socket. Doing so will minimize any
108+
accidental downtime of the DRA driver and give it more opportunities to self
109+
heal, reducing scheduling delays or troubleshooting time.
110+
-->
111+
### 确认你的 DRA 驱动暴露了存活探针并加以利用 {#confirm-your-dra-driver-exposes-a-liveness-probe-and-utilize-it}
112+
113+
你的 DRA 驱动可能已实现用于健康检查的 grpc 套接字,这是 DRA 驱动的良好实践之一。
114+
最简单的利用方式是将该 grpc 套接字配置为部署 DRA 驱动 DaemonSet 的存活探针。
115+
驱动文档或部署工具可能已包括此项配置,但如果你是自行配置或未以 Kubernetes Pod 方式运行 DRA 驱动,
116+
确保你的编排工具在该 grpc 套接字健康检查失败时能重启驱动。这样可以最大程度地减少 DRA 驱动的意外停机,
117+
并提升其自我修复能力,从而减少调度延迟或排障时间。
118+
119+
<!--
120+
### When draining a node, drain the DRA driver as late as possible
121+
122+
The DRA driver is responsible for unpreparing any devices that were allocated to
123+
Pods, and if the DRA driver is {{< glossary_tooltip text="drained"
124+
term_id="drain" >}} before Pods with claims have been deleted, it will not be
125+
able to finalize its cleanup. If you implement custom drain logic for nodes,
126+
consider checking that there are no allocated/reserved ResourceClaim or
127+
ResourceClaimTemplates before terminating the DRA driver itself.
128+
-->
129+
### 腾空节点时尽可能最后再腾空 DRA 驱动 {#when-draining-a-node-drain-the-dra-driver-as-late-as-possible}
130+
131+
DRA 驱动负责取消为 Pod 分配的任意设备的就绪状态。如果在具有申领的 Pod 被删除之前 DRA
132+
驱动就被{{< glossary_tooltip text="腾空" term_id="drain" >}},它将无法完成清理流程。
133+
如果你实现了自定义的节点腾空逻辑,建议在终止 DRA 驱动之前检查是否存在已分配/已保留的
134+
ResourceClaim 或 ResourceClaimTemplate。
135+
136+
<!--
137+
## Monitor and tune components for higher load, especially in high scale environments
138+
139+
Control plane component `kube-scheduler` and the internal ResourceClaim
140+
controller orchestrated by the component `kube-controller-manager` do the heavy
141+
lifting during scheduling of Pods with claims based on metadata stored in the
142+
DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls,
143+
memory, and CPU utilization needed by these components is increased for Pods
144+
using DRA claims. In addition, node local components like the DRA driver and
145+
kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
146+
creation time. Especially in high scale environments where clusters have many
147+
nodes, and/or deploy many workloads that heavily utilize DRA defined resource
148+
claims, the cluster administrator should configure the relevant components to
149+
anticipate the increased load.
150+
-->
151+
## 在大规模环境中在高负载场景下监控和调优组件 {#monitor-and-tune-components-for-higher-load-especially-in-high-scale-environments}
152+
153+
控制面组件 `kube-scheduler` 以及 `kube-controller-manager` 中的内部 ResourceClaim
154+
控制器在调度使用 DRA 申领的 Pod 时承担了大量任务。与不使用 DRA 的 Pod 相比,这些组件所需的
155+
API 服务器调用次数、内存和 CPU 使用率都更高。此外,节点本地组件(如 DRA 驱动和 kubelet)也在创建
156+
Pod 沙箱时使用 DRA API 分配硬件请求资源。
157+
尤其在集群节点数量众多或大量工作负载依赖 DRA 定义的资源申领时,集群管理员应当预先为相关组件配置合理参数以应对增加的负载。
158+
159+
<!--
160+
The effects of mistuned components can have direct or snowballing affects
161+
causing different symptoms during the Pod lifecycle. If the `kube-scheduler`
162+
component's QPS and burst configurations are too low, the scheduler might
163+
quickly identify a suitable node for a Pod but take longer to bind the Pod to
164+
that node. With DRA, during Pod scheduling, the QPS and Burst parameters in the
165+
client-go configuration within `kube-controller-manager` are critical.
166+
-->
167+
组件配置不当可能会直接或连锁地影响 Pod 生命周期中的多个环节。例如,如果 `kube-scheduler`
168+
组件的 QPS 和 Burst 配置值过低,调度器可能能快速识别适合的节点,但绑定 Pod 到节点的过程则会变慢。
169+
在使用 DRA 的调度流程中,`kube-controller-manager` 中 client-go 的 QPS 和 Burst 参数尤为关键。
170+
171+
<!--
172+
The specific values to tune your cluster to depend on a variety of factors like
173+
number of nodes/pods, rate of pod creation, churn, even in non-DRA environments;
174+
see the [SIG-Scalability README on Kubernetes scalability
175+
thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
176+
for more information. In scale tests performed against a DRA enabled cluster
177+
with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods
178+
(10% churn, 10 times), with a job creation QPS of 10, `kube-controller-manager`
179+
QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric
180+
targets for non-DRA deployments. At this lower bound, it was observed that the
181+
client side rate limiter was triggered enough to protect apiserver from
182+
explosive burst but was is high enough that pod startup SLOs were not impacted.
183+
While this is a good starting point, you can get a better idea of how to tune
184+
the different components that have the biggest effect on DRA performance for
185+
your deployment by monitoring the following metrics.
186+
-->
187+
集群调优所需的具体数值取决于多个因素,如节点/Pod 数量、Pod 创建速率、变化频率,甚至与是否使用 DRA 无关。更多信息请参考
188+
[SIG-Scalability README 中的可扩缩性阈值](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
189+
在一项针对启用了 DRA 的 100 节点集群的规模测试中,部署了 720 个长生命周期 Pod(90% 饱和度)和 80
190+
个短周期 Pod(10% 流失,重复 10 次),作业创建 QPS 为 10。将 `kube-controller-manager` 的 QPS
191+
设置为 75、Burst 设置为 150,能达到与非 DRA 部署中相同的性能指标。在这个下限设置下,
192+
客户端速率限制器能有效保护 API 服务器避免突发请求,同时不影响 Pod 启动 SLO。
193+
这可作为一个良好的起点。你可以通过监控下列指标,进一步判断对 DRA 性能影响最大的组件,从而优化其配置。
194+
195+
<!--
196+
### `kube-controller-manager` metrics
197+
198+
The following metrics look closely at the internal ResourceClaim controller
199+
managed by the `kube-controller-manager` component.
200+
-->
201+
### `kube-controller-manager` 指标 {#kube-controller-manager-metrics}
202+
203+
以下指标聚焦于由 `kube-controller-manager` 组件管理的内部 ResourceClaim 控制器:
204+
205+
<!--
206+
* Workqueue Add Rate: Monitor
207+
`sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how
208+
quickly items are added to the ResourceClaim controller.
209+
* Workqueue Depth: Track
210+
`sum(workqueue_depth{endpoint="kube-controller-manager",
211+
name="resource_claim"})` to identify any backlogs in the ResourceClaim
212+
controller.
213+
* Workqueue Work Duration: Observe `histogram_quantile(0.99,
214+
sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m]))
215+
by (le))` to understand the speed at which the ResourceClaim controller
216+
processes work.
217+
-->
218+
* 工作队列添加速率:监控 `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))`
219+
以衡量任务加入 ResourceClaim 控制器的速度。
220+
* 工作队列深度:跟踪 `sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"})`
221+
识别 ResourceClaim 控制器中是否存在积压。
222+
* 工作队列处理时长:观察
223+
`histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le))`
224+
以了解 ResourceClaim 控制器的处理速度。
225+
226+
<!--
227+
If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or
228+
high Workqueue Work Duration, this suggests the controller isn't performing
229+
optimally. Consider tuning parameters like QPS, burst, and CPU/memory
230+
configurations.
231+
232+
If you are experiencing high Workequeue Add Rate, high Workqueue Depth, but
233+
reasonable Workqueue Work Duration, this indicates the controller is processing
234+
work, but concurrency might be insufficient. Concurrency is hardcoded in the
235+
controller, so as a cluster administrator, you can tune for this by reducing the
236+
pod creation QPS, so the add rate to the resource claim workqueue is more
237+
manageable.
238+
-->
239+
如果你观察到工作队列添加速率低、工作队列深度高和/或工作队列处理时间长,
240+
则说明控制器性能可能不理想。你可以考虑调优 QPS、Burst 以及 CPU/内存配置。
241+
242+
如果你观察到工作队列添加速率高、工作队列深度高,但工作队列处理时间合理,
243+
则说明控制器正在有效处理任务,但并发可能不足。由于控制器并发是硬编码的,
244+
所以集群管理员可以通过降低 Pod 创建 QPS 来减缓资源申领任务队列的压力。
245+
246+
<!--
247+
### `kube-scheduler` metrics
248+
249+
The following scheduler metrics are high level metrics aggregating performance
250+
across all Pods scheduled, not just those using DRA. It is important to note
251+
that the end-to-end metrics are ultimately influenced by the
252+
kube-controller-manager's performance in creating ResourceClaims from
253+
ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
254+
-->
255+
### `kube-scheduler` 指标 {#kube-scheduler-metrics}
256+
257+
以下调度器指标是所有 Pod 的整体性能聚合指标,不仅限于使用 DRA 的 Pod。需注意,
258+
这些端到端指标最终也会受到 `kube-controller-manager` 创建 ResourceClaim
259+
的性能影响,尤其在广泛使用 ResourceClaimTemplate 的部署中。
260+
261+
<!--
262+
* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99,
263+
sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by
264+
(le))`.
265+
* Scheduler Algorithm Latency: Track `histogram_quantile(0.99,
266+
sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by
267+
(le))`.
268+
-->
269+
* 调度器端到端耗时:监控
270+
`histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le))`
271+
* 调度器算法延迟:跟踪
272+
`histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))`
273+
274+
<!--
275+
### `kubelet` metrics
276+
277+
When a Pod bound to a node must have a ResourceClaim satisfied, kubelet calls
278+
the `NodePrepareResources` and `NodeUnprepareResources` methods of the DRA
279+
driver. You can observe this behavior from the kubelet's point of view with the
280+
following metrics.
281+
-->
282+
### `kubelet` 指标 {#kubelet-metrics}
283+
284+
当绑定到节点的 Pod 必须满足 ResourceClaim 时,kubelet 会调用 DRA 驱动的
285+
`NodePrepareResources``NodeUnprepareResources` 方法。你可以通过以下指标从 kubelet 的角度观察其行为。
286+
287+
<!--
288+
* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99,
289+
sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m]))
290+
by (le))`.
291+
* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99,
292+
sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m]))
293+
by (le))`.
294+
-->
295+
* kubelet 调用 PrepareResources:监控
296+
`histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le))`
297+
* kubelet 调用 UnprepareResources:跟踪
298+
`histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le))`
299+
<!--
300+
### DRA kubeletplugin operations
301+
302+
DRA drivers implement the [`kubeletplugin` package
303+
interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
304+
which surfaces its own metric for the underlying gRPC operation
305+
`NodePrepareResources` and `NodeUnprepareResources`. You can observe this
306+
behavior from the point of view of the internal kubeletplugin with the following
307+
metrics.
308+
-->
309+
### DRA kubeletplugin 操作 {#dra-kubeletplugin-operations}
310+
311+
DRA 驱动实现 [`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
312+
该接口会针对底层 gRPC 操作 `NodePrepareResources``NodeUnprepareResources` 暴露指标。
313+
你可以从内部 kubeletplugin 的角度通过以下指标观察其行为:
314+
315+
<!--
316+
* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99,
317+
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m]))
318+
by (le))`
319+
* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99,
320+
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m]))
321+
by (le))`.
322+
-->
323+
* DRA kubeletplugin 的 NodePrepareResources 操作:观察
324+
`histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le))`
325+
* DRA kubeletplugin 的 NodeUnprepareResources 操作:观察
326+
`histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le))`
327+
328+
## {{% heading "whatsnext" %}}
329+
330+
<!--
331+
* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
332+
-->
333+
* [进一步了解 DRA](/zh-cn/docs/concepts/scheduling-eviction/dynamic-resource-allocation)

0 commit comments

Comments
 (0)