|
| 1 | +--- |
| 2 | +content_type: reference |
| 3 | +title: 节点状态 |
| 4 | +weight: 80 |
| 5 | +--- |
| 6 | +<!-- |
| 7 | +content_type: reference |
| 8 | +title: Node Status |
| 9 | +weight: 80 |
| 10 | +--> |
| 11 | + |
| 12 | +<!-- overview --> |
| 13 | + |
| 14 | +<!-- |
| 15 | +The status of a [node](/docs/concepts/architecture/nodes/) in Kubernetes is a critical |
| 16 | +aspect of managing a Kubernetes cluster. In this article, we'll cover the basics of |
| 17 | +monitoring and maintaining node status to ensure a healthy and stable cluster. |
| 18 | +--> |
| 19 | +在 Kubernetes 中,[节点](/zh-cn/docs/concepts/architecture/nodes/)的状态是管理 Kubernetes |
| 20 | +集群的一个关键方面。在本文中,我们将简要介绍如何监控和维护节点状态以确保集群的健康和稳定。 |
| 21 | + |
| 22 | +<!-- |
| 23 | +## Node status fields |
| 24 | +
|
| 25 | +A Node's status contains the following information: |
| 26 | +
|
| 27 | +* [Addresses](#addresses) |
| 28 | +* [Conditions](#condition) |
| 29 | +* [Capacity and Allocatable](#capacity) |
| 30 | +* [Info](#info) |
| 31 | +--> |
| 32 | +## 节点状态字段 {#node-status-fields} |
| 33 | + |
| 34 | +一个节点的状态包含以下信息: |
| 35 | + |
| 36 | +* [地址(Addresses)](#addresses) |
| 37 | +* [状况(Condition)](#condition) |
| 38 | +* [容量与可分配(Capacity)](#capacity) |
| 39 | +* [信息(Info)](#info) |
| 40 | + |
| 41 | +<!-- |
| 42 | +You can use `kubectl` to view a Node's status and other details: |
| 43 | +
|
| 44 | +```shell |
| 45 | +kubectl describe node <insert-node-name-here> |
| 46 | +``` |
| 47 | +
|
| 48 | +Each section of the output is described below. |
| 49 | +--> |
| 50 | +你可以使用 `kubectl` 来查看节点状态和其他细节信息: |
| 51 | + |
| 52 | +```shell |
| 53 | +kubectl describe node <节点名称> |
| 54 | +``` |
| 55 | + |
| 56 | +下面对输出的每个部分进行详细描述。 |
| 57 | + |
| 58 | +<!-- |
| 59 | +## Addresses |
| 60 | +
|
| 61 | +The usage of these fields varies depending on your cloud provider or bare metal configuration. |
| 62 | +--> |
| 63 | +### 地址 {#addresses} |
| 64 | + |
| 65 | +这些字段的用法取决于你的云服务商或者物理机配置。 |
| 66 | + |
| 67 | +<!-- |
| 68 | +* HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet |
| 69 | + `--hostname-override` parameter. |
| 70 | +* ExternalIP: Typically the IP address of the node that is externally routable (available from |
| 71 | + outside the cluster). |
| 72 | +* InternalIP: Typically the IP address of the node that is routable only within the cluster. |
| 73 | +--> |
| 74 | +* HostName:由节点的内核报告。可以通过 kubelet 的 `--hostname-override` 参数覆盖。 |
| 75 | +* ExternalIP:通常是节点的可外部路由(从集群外可访问)的 IP 地址。 |
| 76 | +* InternalIP:通常是节点的仅可在集群内部路由的 IP 地址。 |
| 77 | + |
| 78 | +<!-- |
| 79 | +## Conditions {#condition} |
| 80 | +
|
| 81 | +The `conditions` field describes the status of all `Running` nodes. Examples of conditions include: |
| 82 | +--> |
| 83 | +### 状况 {#condition} |
| 84 | + |
| 85 | +`conditions` 字段描述了所有 `Running` 节点的状况。状况的示例包括: |
| 86 | + |
| 87 | +<!-- |
| 88 | +{{< table caption = "Node conditions, and a description of when each condition applies." >}} |
| 89 | +| Node Condition | Description | |
| 90 | +|----------------------|-------------| |
| 91 | +| `Ready` | `True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last `node-monitor-grace-period` (default is 40 seconds) | |
| 92 | +| `DiskPressure` | `True` if pressure exists on the disk size—that is, if the disk capacity is low; otherwise `False` | |
| 93 | +| `MemoryPressure` | `True` if pressure exists on the node memory—that is, if the node memory is low; otherwise `False` | |
| 94 | +| `PIDPressure` | `True` if pressure exists on the processes—that is, if there are too many processes on the node; otherwise `False` | |
| 95 | +| `NetworkUnavailable` | `True` if the network for the node is not correctly configured, otherwise `False` | |
| 96 | +{{< /table >}} |
| 97 | +--> |
| 98 | +{{< table caption = "节点状况及每种状况适用场景的描述" >}} |
| 99 | +| 节点状况 | 描述 | |
| 100 | +|----------------|-------------| |
| 101 | +| `Ready` | 如节点是健康的并已经准备好接收 Pod 则为 `True`;`False` 表示节点不健康而且不能接收 Pod;`Unknown` 表示节点控制器在最近 `node-monitor-grace-period` 期间(默认 40 秒)没有收到节点的消息 | |
| 102 | +| `DiskPressure` | `True` 表示节点存在磁盘空间压力,即磁盘可用量低,否则为 `False` | |
| 103 | +| `MemoryPressure` | `True` 表示节点存在内存压力,即节点内存可用量低,否则为 `False` | |
| 104 | +| `PIDPressure` | `True` 表示节点存在进程压力,即节点上进程过多;否则为 `False` | |
| 105 | +| `NetworkUnavailable` | `True` 表示节点网络配置不正确;否则为 `False` | |
| 106 | +{{< /table >}} |
| 107 | + |
| 108 | +{{< note >}} |
| 109 | +<!-- |
| 110 | +If you use command-line tools to print details of a cordoned Node, the Condition includes |
| 111 | +`SchedulingDisabled`. `SchedulingDisabled` is not a Condition in the Kubernetes API; instead, |
| 112 | +cordoned nodes are marked Unschedulable in their spec. |
| 113 | +--> |
| 114 | +如果使用命令行工具来打印已保护(Cordoned)节点的细节,其中的 Condition 字段可能包括 |
| 115 | +`SchedulingDisabled`。`SchedulingDisabled` 不是 Kubernetes API 中定义的 |
| 116 | +Condition,被保护起来的节点在其规约中被标记为不可调度(Unschedulable)。 |
| 117 | +{{< /note >}} |
| 118 | + |
| 119 | +<!-- |
| 120 | +In the Kubernetes API, a node's condition is represented as part of the `.status` |
| 121 | +of the Node resource. For example, the following JSON structure describes a healthy node: |
| 122 | +--> |
| 123 | +在 Kubernetes API 中,节点的状况表示节点资源中 `.status` 的一部分。 |
| 124 | +例如,以下 JSON 结构描述了一个健康节点: |
| 125 | + |
| 126 | +```json |
| 127 | +"conditions": [ |
| 128 | + { |
| 129 | + "type": "Ready", |
| 130 | + "status": "True", |
| 131 | + "reason": "KubeletReady", |
| 132 | + "message": "kubelet is posting ready status", |
| 133 | + "lastHeartbeatTime": "2019-06-05T18:38:35Z", |
| 134 | + "lastTransitionTime": "2019-06-05T11:41:27Z" |
| 135 | + } |
| 136 | +] |
| 137 | +``` |
| 138 | + |
| 139 | +<!-- |
| 140 | +When problems occur on nodes, the Kubernetes control plane automatically creates |
| 141 | +[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that match the conditions |
| 142 | +affecting the node. An example of this is when the `status` of the Ready condition |
| 143 | +remains `Unknown` or `False` for longer than the kube-controller-manager's `NodeMonitorGracePeriod`, |
| 144 | +which defaults to 40 seconds. This will cause either an `node.kubernetes.io/unreachable` taint, for an `Unknown` status, |
| 145 | +or a `node.kubernetes.io/not-ready` taint, for a `False` status, to be added to the Node. |
| 146 | +--> |
| 147 | +当节点上出现问题时,Kubernetes 控制面会自动创建与影响节点的状况对应的 |
| 148 | +[污点](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/)。 |
| 149 | +例如当 Ready 状况的 `status` 保持 `Unknown` 或 `False` 的时间长于 |
| 150 | +kube-controller-manager 的 `NodeMonitorGracePeriod`(默认为 40 秒)时, |
| 151 | +会造成 `Unknown` 状态下为节点添加 `node.kubernetes.io/unreachable` 污点或在 |
| 152 | +`False` 状态下为节点添加 `node.kubernetes.io/not-ready` 污点。 |
| 153 | + |
| 154 | +<!-- |
| 155 | +These taints affect pending pods as the scheduler takes the Node's taints into consideration when |
| 156 | +assigning a pod to a Node. Existing pods scheduled to the node may be evicted due to the application |
| 157 | +of `NoExecute` taints. Pods may also have {{< glossary_tooltip text="tolerations" term_id="toleration" >}} that let |
| 158 | +them schedule to and continue running on a Node even though it has a specific taint. |
| 159 | +--> |
| 160 | +这些污点会影响悬决的 Pod,因为调度器在将 Pod 分配到节点时会考虑节点的污点。 |
| 161 | +已调度到节点的当前 Pod 可能会由于施加的 `NoExecute` 污点被驱逐。 |
| 162 | +Pod 还可以设置{{< glossary_tooltip text="容忍度" term_id="toleration" >}}, |
| 163 | +使得这些 Pod 仍然能够调度到且继续运行在设置了特定污点的节点上。 |
| 164 | + |
| 165 | +<!-- |
| 166 | +See [Taint Based Evictions](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions) and |
| 167 | +[Taint Nodes by Condition](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition) |
| 168 | +for more details. |
| 169 | +--> |
| 170 | +进一步的细节可参阅[基于污点的驱逐](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions) |
| 171 | +和[根据状况为节点设置污点](/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)。 |
| 172 | + |
| 173 | +<!-- |
| 174 | +## Capacity and Allocatable {#capacity} |
| 175 | +
|
| 176 | +Describes the resources available on the node: CPU, memory, and the maximum |
| 177 | +number of pods that can be scheduled onto the node. |
| 178 | +--> |
| 179 | +### 容量(Capacity)与可分配(Allocatable) {#capacity} |
| 180 | + |
| 181 | +这两个值描述节点上的可用资源:CPU、内存和可以调度到节点上的 Pod 的个数上限。 |
| 182 | + |
| 183 | +<!-- |
| 184 | +The fields in the capacity block indicate the total amount of resources that a |
| 185 | +Node has. The allocatable block indicates the amount of resources on a |
| 186 | +Node that is available to be consumed by normal Pods. |
| 187 | +--> |
| 188 | +`capacity` 块中的字段标示节点拥有的资源总量。 |
| 189 | +`allocatable` 块指示节点上可供普通 Pod 使用的资源量。 |
| 190 | + |
| 191 | +<!-- |
| 192 | +You may read more about capacity and allocatable resources while learning how |
| 193 | +to [reserve compute resources](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) |
| 194 | +on a Node. |
| 195 | +--> |
| 196 | +你可以通过学习如何在节点上[预留计算资源](/zh-cn/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) |
| 197 | +来进一步了解有关容量和可分配资源的信息。 |
| 198 | + |
| 199 | +<!-- |
| 200 | +## Info |
| 201 | +
|
| 202 | +Describes general information about the node, such as kernel version, Kubernetes |
| 203 | +version (kubelet and kube-proxy version), container runtime details, and which |
| 204 | +operating system the node uses. |
| 205 | +The kubelet gathers this information from the node and publishes it into |
| 206 | +the Kubernetes API. |
| 207 | +--> |
| 208 | +### 信息(Info) {#info} |
| 209 | + |
| 210 | +Info 指的是节点的一般信息,如内核版本、Kubernetes 版本(`kubelet` 和 `kube-proxy` 版本)、 |
| 211 | +容器运行时详细信息,以及节点使用的操作系统。 |
| 212 | +`kubelet` 从节点收集这些信息并将其发布到 Kubernetes API。 |
| 213 | + |
| 214 | +<!-- |
| 215 | +## Heartbeats |
| 216 | +
|
| 217 | +Heartbeats, sent by Kubernetes nodes, help your cluster determine the |
| 218 | +availability of each node, and to take action when failures are detected. |
| 219 | +--> |
| 220 | +## 心跳 {#heartbeats} |
| 221 | + |
| 222 | +Kubernetes 节点发送的心跳帮助你的集群确定每个节点的可用性,并在检测到故障时采取行动。 |
| 223 | + |
| 224 | +<!-- |
| 225 | +For nodes there are two forms of heartbeats: |
| 226 | +
|
| 227 | +* updates to the `.status` of a Node |
| 228 | +* [Lease](/docs/concepts/architecture/leases/) objects |
| 229 | + within the `kube-node-lease` |
| 230 | + {{< glossary_tooltip term_id="namespace" text="namespace">}}. |
| 231 | + Each Node has an associated Lease object. |
| 232 | +--> |
| 233 | +对于节点,有两种形式的心跳: |
| 234 | + |
| 235 | +* 更新节点的 `.status` |
| 236 | +* `kube-node-lease` {{<glossary_tooltip term_id="namespace" text="名字空间">}}中的 |
| 237 | + [Lease(租约)](/zh-cn/docs/concepts/architecture/leases/)对象。 |
| 238 | + 每个节点都有一个关联的 Lease 对象。 |
| 239 | + |
| 240 | +<!-- |
| 241 | +Compared to updates to `.status` of a Node, a Lease is a lightweight resource. |
| 242 | +Using Leases for heartbeats reduces the performance impact of these updates |
| 243 | +for large clusters. |
| 244 | +
|
| 245 | +The kubelet is responsible for creating and updating the `.status` of Nodes, |
| 246 | +and for updating their related Leases. |
| 247 | +--> |
| 248 | +与节点的 `.status` 更新相比,Lease 是一种轻量级资源。 |
| 249 | +使用 Lease 来表达心跳在大型集群中可以减少这些更新对性能的影响。 |
| 250 | + |
| 251 | +kubelet 负责创建和更新节点的 `.status`,以及更新它们对应的 Lease。 |
| 252 | + |
| 253 | +<!-- |
| 254 | +- The kubelet updates the node's `.status` either when there is change in status |
| 255 | + or if there has been no update for a configured interval. The default interval |
| 256 | + for `.status` updates to Nodes is 5 minutes, which is much longer than the 40 |
| 257 | + second default timeout for unreachable nodes. |
| 258 | +- The kubelet creates and then updates its Lease object every 10 seconds |
| 259 | + (the default update interval). Lease updates occur independently from |
| 260 | + updates to the Node's `.status`. If the Lease update fails, the kubelet retries, |
| 261 | + using exponential backoff that starts at 200 milliseconds and capped at 7 seconds. |
| 262 | +--> |
| 263 | +- 当节点状态发生变化时,或者在配置的时间间隔内没有更新事件时,kubelet 会更新 `.status`。 |
| 264 | + `.status` 更新的默认间隔为 5 分钟(比节点不可达事件的 40 秒默认超时时间长很多)。 |
| 265 | +- `kubelet` 会创建并每 10 秒(默认更新间隔时间)更新 Lease 对象。 |
| 266 | + Lease 的更新独立于节点的 `.status` 更新而发生。 |
| 267 | + 如果 Lease 的更新操作失败,kubelet 会采用指数回退机制,从 200 毫秒开始重试, |
| 268 | + 最长重试间隔为 7 秒钟。 |
0 commit comments