Skip to content

Commit 6251128

Browse files
authored
Merge pull request #42640 from windsonsea/nodestatus
[zh] sync /reference/node/node-status.md
2 parents eee06d1 + 81380bd commit 6251128

File tree

1 file changed

+268
-0
lines changed

1 file changed

+268
-0
lines changed
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
---
2+
content_type: reference
3+
title: 节点状态
4+
weight: 80
5+
---
6+
<!--
7+
content_type: reference
8+
title: Node Status
9+
weight: 80
10+
-->
11+
12+
<!-- overview -->
13+
14+
<!--
15+
The status of a [node](/docs/concepts/architecture/nodes/) in Kubernetes is a critical
16+
aspect of managing a Kubernetes cluster. In this article, we'll cover the basics of
17+
monitoring and maintaining node status to ensure a healthy and stable cluster.
18+
-->
19+
在 Kubernetes 中,[节点](/zh-cn/docs/concepts/architecture/nodes/)的状态是管理 Kubernetes
20+
集群的一个关键方面。在本文中,我们将简要介绍如何监控和维护节点状态以确保集群的健康和稳定。
21+
22+
<!--
23+
## Node status fields
24+
25+
A Node's status contains the following information:
26+
27+
* [Addresses](#addresses)
28+
* [Conditions](#condition)
29+
* [Capacity and Allocatable](#capacity)
30+
* [Info](#info)
31+
-->
32+
## 节点状态字段 {#node-status-fields}
33+
34+
一个节点的状态包含以下信息:
35+
36+
* [地址(Addresses)](#addresses)
37+
* [状况(Condition)](#condition)
38+
* [容量与可分配(Capacity)](#capacity)
39+
* [信息(Info)](#info)
40+
41+
<!--
42+
You can use `kubectl` to view a Node's status and other details:
43+
44+
```shell
45+
kubectl describe node <insert-node-name-here>
46+
```
47+
48+
Each section of the output is described below.
49+
-->
50+
你可以使用 `kubectl` 来查看节点状态和其他细节信息:
51+
52+
```shell
53+
kubectl describe node <节点名称>
54+
```
55+
56+
下面对输出的每个部分进行详细描述。
57+
58+
<!--
59+
## Addresses
60+
61+
The usage of these fields varies depending on your cloud provider or bare metal configuration.
62+
-->
63+
### 地址 {#addresses}
64+
65+
这些字段的用法取决于你的云服务商或者物理机配置。
66+
67+
<!--
68+
* HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet
69+
`--hostname-override` parameter.
70+
* ExternalIP: Typically the IP address of the node that is externally routable (available from
71+
outside the cluster).
72+
* InternalIP: Typically the IP address of the node that is routable only within the cluster.
73+
-->
74+
* HostName:由节点的内核报告。可以通过 kubelet 的 `--hostname-override` 参数覆盖。
75+
* ExternalIP:通常是节点的可外部路由(从集群外可访问)的 IP 地址。
76+
* InternalIP:通常是节点的仅可在集群内部路由的 IP 地址。
77+
78+
<!--
79+
## Conditions {#condition}
80+
81+
The `conditions` field describes the status of all `Running` nodes. Examples of conditions include:
82+
-->
83+
### 状况 {#condition}
84+
85+
`conditions` 字段描述了所有 `Running` 节点的状况。状况的示例包括:
86+
87+
<!--
88+
{{< table caption = "Node conditions, and a description of when each condition applies." >}}
89+
| Node Condition | Description |
90+
|----------------------|-------------|
91+
| `Ready` | `True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last `node-monitor-grace-period` (default is 40 seconds) |
92+
| `DiskPressure` | `True` if pressure exists on the disk size—that is, if the disk capacity is low; otherwise `False` |
93+
| `MemoryPressure` | `True` if pressure exists on the node memory—that is, if the node memory is low; otherwise `False` |
94+
| `PIDPressure` | `True` if pressure exists on the processes—that is, if there are too many processes on the node; otherwise `False` |
95+
| `NetworkUnavailable` | `True` if the network for the node is not correctly configured, otherwise `False` |
96+
{{< /table >}}
97+
-->
98+
{{< table caption = "节点状况及每种状况适用场景的描述" >}}
99+
| 节点状况 | 描述 |
100+
|----------------|-------------|
101+
| `Ready` | 如节点是健康的并已经准备好接收 Pod 则为 `True``False` 表示节点不健康而且不能接收 Pod;`Unknown` 表示节点控制器在最近 `node-monitor-grace-period` 期间(默认 40 秒)没有收到节点的消息 |
102+
| `DiskPressure` | `True` 表示节点存在磁盘空间压力,即磁盘可用量低,否则为 `False` |
103+
| `MemoryPressure` | `True` 表示节点存在内存压力,即节点内存可用量低,否则为 `False` |
104+
| `PIDPressure` | `True` 表示节点存在进程压力,即节点上进程过多;否则为 `False` |
105+
| `NetworkUnavailable` | `True` 表示节点网络配置不正确;否则为 `False` |
106+
{{< /table >}}
107+
108+
{{< note >}}
109+
<!--
110+
If you use command-line tools to print details of a cordoned Node, the Condition includes
111+
`SchedulingDisabled`. `SchedulingDisabled` is not a Condition in the Kubernetes API; instead,
112+
cordoned nodes are marked Unschedulable in their spec.
113+
-->
114+
如果使用命令行工具来打印已保护(Cordoned)节点的细节,其中的 Condition 字段可能包括
115+
`SchedulingDisabled``SchedulingDisabled` 不是 Kubernetes API 中定义的
116+
Condition,被保护起来的节点在其规约中被标记为不可调度(Unschedulable)。
117+
{{< /note >}}
118+
119+
<!--
120+
In the Kubernetes API, a node's condition is represented as part of the `.status`
121+
of the Node resource. For example, the following JSON structure describes a healthy node:
122+
-->
123+
在 Kubernetes API 中,节点的状况表示节点资源中 `.status` 的一部分。
124+
例如,以下 JSON 结构描述了一个健康节点:
125+
126+
```json
127+
"conditions": [
128+
{
129+
"type": "Ready",
130+
"status": "True",
131+
"reason": "KubeletReady",
132+
"message": "kubelet is posting ready status",
133+
"lastHeartbeatTime": "2019-06-05T18:38:35Z",
134+
"lastTransitionTime": "2019-06-05T11:41:27Z"
135+
}
136+
]
137+
```
138+
139+
<!--
140+
When problems occur on nodes, the Kubernetes control plane automatically creates
141+
[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that match the conditions
142+
affecting the node. An example of this is when the `status` of the Ready condition
143+
remains `Unknown` or `False` for longer than the kube-controller-manager's `NodeMonitorGracePeriod`,
144+
which defaults to 40 seconds. This will cause either an `node.kubernetes.io/unreachable` taint, for an `Unknown` status,
145+
or a `node.kubernetes.io/not-ready` taint, for a `False` status, to be added to the Node.
146+
-->
147+
当节点上出现问题时,Kubernetes 控制面会自动创建与影响节点的状况对应的
148+
[污点](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/)
149+
例如当 Ready 状况的 `status` 保持 `Unknown``False` 的时间长于
150+
kube-controller-manager 的 `NodeMonitorGracePeriod`(默认为 40 秒)时,
151+
会造成 `Unknown` 状态下为节点添加 `node.kubernetes.io/unreachable` 污点或在
152+
`False` 状态下为节点添加 `node.kubernetes.io/not-ready` 污点。
153+
154+
<!--
155+
These taints affect pending pods as the scheduler takes the Node's taints into consideration when
156+
assigning a pod to a Node. Existing pods scheduled to the node may be evicted due to the application
157+
of `NoExecute` taints. Pods may also have {{< glossary_tooltip text="tolerations" term_id="toleration" >}} that let
158+
them schedule to and continue running on a Node even though it has a specific taint.
159+
-->
160+
这些污点会影响悬决的 Pod,因为调度器在将 Pod 分配到节点时会考虑节点的污点。
161+
已调度到节点的当前 Pod 可能会由于施加的 `NoExecute` 污点被驱逐。
162+
Pod 还可以设置{{< glossary_tooltip text="容忍度" term_id="toleration" >}},
163+
使得这些 Pod 仍然能够调度到且继续运行在设置了特定污点的节点上。
164+
165+
<!--
166+
See [Taint Based Evictions](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions) and
167+
[Taint Nodes by Condition](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
168+
for more details.
169+
-->
170+
进一步的细节可参阅[基于污点的驱逐](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions)
171+
[根据状况为节点设置污点](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
172+
173+
<!--
174+
## Capacity and Allocatable {#capacity}
175+
176+
Describes the resources available on the node: CPU, memory, and the maximum
177+
number of pods that can be scheduled onto the node.
178+
-->
179+
### 容量(Capacity)与可分配(Allocatable) {#capacity}
180+
181+
这两个值描述节点上的可用资源:CPU、内存和可以调度到节点上的 Pod 的个数上限。
182+
183+
<!--
184+
The fields in the capacity block indicate the total amount of resources that a
185+
Node has. The allocatable block indicates the amount of resources on a
186+
Node that is available to be consumed by normal Pods.
187+
-->
188+
`capacity` 块中的字段标示节点拥有的资源总量。
189+
`allocatable` 块指示节点上可供普通 Pod 使用的资源量。
190+
191+
<!--
192+
You may read more about capacity and allocatable resources while learning how
193+
to [reserve compute resources](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
194+
on a Node.
195+
-->
196+
你可以通过学习如何在节点上[预留计算资源](/zh-cn/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
197+
来进一步了解有关容量和可分配资源的信息。
198+
199+
<!--
200+
## Info
201+
202+
Describes general information about the node, such as kernel version, Kubernetes
203+
version (kubelet and kube-proxy version), container runtime details, and which
204+
operating system the node uses.
205+
The kubelet gathers this information from the node and publishes it into
206+
the Kubernetes API.
207+
-->
208+
### 信息(Info) {#info}
209+
210+
Info 指的是节点的一般信息,如内核版本、Kubernetes 版本(`kubelet``kube-proxy` 版本)、
211+
容器运行时详细信息,以及节点使用的操作系统。
212+
`kubelet` 从节点收集这些信息并将其发布到 Kubernetes API。
213+
214+
<!--
215+
## Heartbeats
216+
217+
Heartbeats, sent by Kubernetes nodes, help your cluster determine the
218+
availability of each node, and to take action when failures are detected.
219+
-->
220+
## 心跳 {#heartbeats}
221+
222+
Kubernetes 节点发送的心跳帮助你的集群确定每个节点的可用性,并在检测到故障时采取行动。
223+
224+
<!--
225+
For nodes there are two forms of heartbeats:
226+
227+
* updates to the `.status` of a Node
228+
* [Lease](/docs/concepts/architecture/leases/) objects
229+
within the `kube-node-lease`
230+
{{< glossary_tooltip term_id="namespace" text="namespace">}}.
231+
Each Node has an associated Lease object.
232+
-->
233+
对于节点,有两种形式的心跳:
234+
235+
* 更新节点的 `.status`
236+
* `kube-node-lease` {{<glossary_tooltip term_id="namespace" text="名字空间">}}中的
237+
[Lease(租约)](/zh-cn/docs/concepts/architecture/leases/)对象。
238+
每个节点都有一个关联的 Lease 对象。
239+
240+
<!--
241+
Compared to updates to `.status` of a Node, a Lease is a lightweight resource.
242+
Using Leases for heartbeats reduces the performance impact of these updates
243+
for large clusters.
244+
245+
The kubelet is responsible for creating and updating the `.status` of Nodes,
246+
and for updating their related Leases.
247+
-->
248+
与节点的 `.status` 更新相比,Lease 是一种轻量级资源。
249+
使用 Lease 来表达心跳在大型集群中可以减少这些更新对性能的影响。
250+
251+
kubelet 负责创建和更新节点的 `.status`,以及更新它们对应的 Lease。
252+
253+
<!--
254+
- The kubelet updates the node's `.status` either when there is change in status
255+
or if there has been no update for a configured interval. The default interval
256+
for `.status` updates to Nodes is 5 minutes, which is much longer than the 40
257+
second default timeout for unreachable nodes.
258+
- The kubelet creates and then updates its Lease object every 10 seconds
259+
(the default update interval). Lease updates occur independently from
260+
updates to the Node's `.status`. If the Lease update fails, the kubelet retries,
261+
using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
262+
-->
263+
- 当节点状态发生变化时,或者在配置的时间间隔内没有更新事件时,kubelet 会更新 `.status`
264+
`.status` 更新的默认间隔为 5 分钟(比节点不可达事件的 40 秒默认超时时间长很多)。
265+
- `kubelet` 会创建并每 10 秒(默认更新间隔时间)更新 Lease 对象。
266+
Lease 的更新独立于节点的 `.status` 更新而发生。
267+
如果 Lease 的更新操作失败,kubelet 会采用指数回退机制,从 200 毫秒开始重试,
268+
最长重试间隔为 7 秒钟。

0 commit comments

Comments
 (0)