Skip to content

Commit 1b8eeb5

Browse files
author
Tim Bannister
committed
Update the node concept
Modernise the page by: - rewording to follow the style guide - adding some glossary tooltips - linking to new-style API reference - linking to Safely Drain a Node plus general tweaks.
1 parent a54e81e commit 1b8eeb5

File tree

1 file changed

+76
-46
lines changed
  • content/en/docs/concepts/architecture

1 file changed

+76
-46
lines changed

content/en/docs/concepts/architecture/nodes.md

Lines changed: 76 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,9 @@ To mark a Node unschedulable, run:
122122
kubectl cordon $NODENAME
123123
```
124124

125+
See [Safely Drain a Node](/docs/tasks/administer-cluster/safely-drain-node/)
126+
for more details.
127+
125128
{{< note >}}
126129
Pods that are part of a {{< glossary_tooltip term_id="daemonset" >}} tolerate
127130
being run on an unschedulable Node. DaemonSets typically provide node-local services
@@ -162,8 +165,8 @@ The `conditions` field describes the status of all `Running` nodes. Examples of
162165
| Node Condition | Description |
163166
|----------------------|-------------|
164167
| `Ready` | `True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last `node-monitor-grace-period` (default is 40 seconds) |
165-
| `DiskPressure` | `True` if pressure exists on the disk size--that is, if the disk capacity is low; otherwise `False` |
166-
| `MemoryPressure` | `True` if pressure exists on the node memory--that is, if the node memory is low; otherwise `False` |
168+
| `DiskPressure` | `True` if pressure exists on the disk sizethat is, if the disk capacity is low; otherwise `False` |
169+
| `MemoryPressure` | `True` if pressure exists on the node memorythat is, if the node memory is low; otherwise `False` |
167170
| `PIDPressure` | `True` if pressure exists on the processes—that is, if there are too many processes on the node; otherwise `False` |
168171
| `NetworkUnavailable` | `True` if the network for the node is not correctly configured, otherwise `False` |
169172
{{< /table >}}
@@ -174,7 +177,8 @@ If you use command-line tools to print details of a cordoned Node, the Condition
174177
cordoned nodes are marked Unschedulable in their spec.
175178
{{< /note >}}
176179

177-
The node condition is represented as a JSON object. For example, the following structure describes a healthy node:
180+
In the Kubernetes API, a node's condition is represented as part of the `.status`
181+
of the Node resource. For example, the following JSON structure describes a healthy node:
178182

179183
```json
180184
"conditions": [
@@ -189,7 +193,17 @@ The node condition is represented as a JSON object. For example, the following s
189193
]
190194
```
191195

192-
If the Status of the Ready condition remains `Unknown` or `False` for longer than the `pod-eviction-timeout` (an argument passed to the {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}), then all the Pods on the node are scheduled for deletion by the node controller. The default eviction timeout duration is **five minutes**. In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
196+
If the `status` of the Ready condition remains `Unknown` or `False` for longer
197+
than the `pod-eviction-timeout` (an argument passed to the
198+
{{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager"
199+
>}}), then the [node controller](#node-controller) triggers
200+
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
201+
for all Pods assigned to that node. The default eviction timeout duration is
202+
**five minutes**.
203+
In some cases when the node is unreachable, the API server is unable to communicate
204+
with the kubelet on the node. The decision to delete the pods cannot be communicated to
205+
the kubelet until communication with the API server is re-established. In the meantime,
206+
the pods that are scheduled for deletion may continue to run on the partitioned node.
193207

194208
The node controller does not force delete pods until it is confirmed that they have stopped
195209
running in the cluster. You can see the pods that might be running on an unreachable node as
@@ -199,10 +213,12 @@ may need to delete the node object by hand. Deleting the node object from Kubern
199213
all the Pod objects running on the node to be deleted from the API server and frees up their
200214
names.
201215

202-
The node lifecycle controller automatically creates
203-
[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that represent conditions.
216+
When problems occur on nodes, the Kubernetes control plane automatically creates
217+
[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that match the conditions
218+
affecting the node.
204219
The scheduler takes the Node's taints into consideration when assigning a Pod to a Node.
205-
Pods can also have tolerations which let them tolerate a Node's taints.
220+
Pods can also have {{< glossary_tooltip text="tolerations" term_id="toleration" >}} that let
221+
them run on a Node even though it has a specific taint.
206222

207223
See [Taint Nodes by Condition](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
208224
for more details.
@@ -222,10 +238,43 @@ on a Node.
222238

223239
### Info
224240

225-
Describes general information about the node, such as kernel version, Kubernetes version (kubelet and kube-proxy version), Docker version (if used), and OS name.
226-
This information is gathered by Kubelet from the node.
241+
Describes general information about the node, such as kernel version, Kubernetes
242+
version (kubelet and kube-proxy version), container runtime details, and which
243+
operating system the node uses.
244+
The kubelet gathers this information from the node and publishes it into
245+
the Kubernetes API.
246+
247+
## Heartbeats
248+
249+
Heartbeats, sent by Kubernetes nodes, help your cluster determine the
250+
availability of each node, and to take action when failures are detected.
251+
252+
For nodes there are two forms of heartbeats:
227253

228-
### Node controller
254+
* updates to the `.status` of a Node
255+
* [Lease](/docs/reference/kubernetes-api/cluster-resources/lease-v1/) objects
256+
within the `kube-node-lease`
257+
{{< glossary_tooltip term_id="namespace" text="namespace">}}.
258+
Each Node has an associated Lease object.
259+
260+
Compared to updates to `.status` of a Node, a Lease is a lightweight resource.
261+
Using Leases for heartbeats reduces the performance impact of these updates
262+
for large clusters.
263+
264+
The kubelet is responsible for creating and updating the `.status` of Nodes,
265+
and for updating their related Leases.
266+
267+
- The kubelet updates the node's `.status` either when there is change in status
268+
or if there has been no update for a configured interval. The default interval
269+
for `.status` updates to Nodes is 5 minutes, which is much longer than the 40
270+
second default timeout for unreachable nodes.
271+
- The kubelet creates and then updates its Lease object every 10 seconds
272+
(the default update interval). Lease updates occur independently from
273+
updates to the Node's `.status`. If the Lease update fails, the kubelet retries,
274+
using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
275+
276+
277+
## Node controller
229278

230279
The node {{< glossary_tooltip text="controller" term_id="controller" >}} is a
231280
Kubernetes control plane component that manages various aspects of nodes.
@@ -241,47 +290,26 @@ controller deletes the node from its list of nodes.
241290

242291
The third is monitoring the nodes' health. The node controller is
243292
responsible for:
244-
- Updating the NodeReady condition of NodeStatus to ConditionUnknown when a node
245-
becomes unreachable, as the node controller stops receiving heartbeats for some
246-
reason such as the node being down.
247-
- Evicting all the pods from the node using graceful termination if
248-
the node continues to be unreachable. The default timeouts are 40s to start
249-
reporting ConditionUnknown and 5m after that to start evicting pods.
293+
- In the case that a node becomes unreachable, updating the NodeReady condition
294+
of within the Node's `.status`. In this case the node controller sets the
295+
NodeReady condition to `ConditionUnknown`.
296+
- If a node remains unreachable: triggering
297+
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/)
298+
for all of the Pods on the unreachable node. By default, the node controller
299+
waits 5 minutes between marking the node as `ConditionUnknown` and submitting
300+
the first eviction request.
250301

251302
The node controller checks the state of each node every `--node-monitor-period` seconds.
252303

253-
#### Heartbeats
254-
255-
Heartbeats, sent by Kubernetes nodes, help determine the availability of a node.
256-
257-
There are two forms of heartbeats: updates of `NodeStatus` and the
258-
[Lease object](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#lease-v1-coordination-k8s-io).
259-
Each Node has an associated Lease object in the `kube-node-lease`
260-
{{< glossary_tooltip term_id="namespace" text="namespace">}}.
261-
Lease is a lightweight resource, which improves the performance
262-
of the node heartbeats as the cluster scales.
263-
264-
The kubelet is responsible for creating and updating the `NodeStatus` and
265-
a Lease object.
266-
267-
- The kubelet updates the `NodeStatus` either when there is change in status
268-
or if there has been no update for a configured interval. The default interval
269-
for `NodeStatus` updates is 5 minutes, which is much longer than the 40 second default
270-
timeout for unreachable nodes.
271-
- The kubelet creates and then updates its Lease object every 10 seconds
272-
(the default update interval). Lease updates occur independently from the
273-
`NodeStatus` updates. If the Lease update fails, the kubelet retries with
274-
exponential backoff starting at 200 milliseconds and capped at 7 seconds.
275-
276-
#### Reliability
304+
### Rate limits on eviction
277305

278306
In most cases, the node controller limits the eviction rate to
279307
`--node-eviction-rate` (default 0.1) per second, meaning it won't evict pods
280308
from more than 1 node per 10 seconds.
281309

282310
The node eviction behavior changes when a node in a given availability zone
283311
becomes unhealthy. The node controller checks what percentage of nodes in the zone
284-
are unhealthy (NodeReady condition is ConditionUnknown or ConditionFalse) at
312+
are unhealthy (NodeReady condition is `ConditionUnknown` or `ConditionFalse`) at
285313
the same time:
286314
- If the fraction of unhealthy nodes is at least `--unhealthy-zone-threshold`
287315
(default 0.55), then the eviction rate is reduced.
@@ -293,23 +321,25 @@ the same time:
293321
The reason these policies are implemented per availability zone is because one
294322
availability zone might become partitioned from the master while the others remain
295323
connected. If your cluster does not span multiple cloud provider availability zones,
296-
then there is only one availability zone (i.e. the whole cluster).
324+
then the eviction mechanism does not take per-zone unavailability into account.
297325

298326
A key reason for spreading your nodes across availability zones is so that the
299327
workload can be shifted to healthy zones when one entire zone goes down.
300328
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
301329
the normal rate of `--node-eviction-rate`. The corner case is when all zones are
302-
completely unhealthy (i.e. there are no healthy nodes in the cluster). In such a
303-
case, the node controller assumes that there is some problem with master
304-
connectivity and stops all evictions until some connectivity is restored.
330+
completely unhealthy (none of the nodes in the cluster are healthy). In such a
331+
case, the node controller assumes that there is some problem with connectivity
332+
between the control plane and the nodes, and doesn't perform any evictions.
333+
(If there has been an outage and some nodes reappear, the node controller does
334+
evict pods from the remaining nodes that are unhealthy or unreachable).
305335

306336
The node controller is also responsible for evicting pods running on nodes with
307337
`NoExecute` taints, unless those pods tolerate that taint.
308338
The node controller also adds {{< glossary_tooltip text="taints" term_id="taint" >}}
309339
corresponding to node problems like node unreachable or not ready. This means
310340
that the scheduler won't place Pods onto unhealthy nodes.
311341

312-
### Node capacity
342+
## Resource capacity tracking {#node-capacity}
313343

314344
Node objects track information about the Node's resource capacity: for example, the amount
315345
of memory available and the number of CPUs.

0 commit comments

Comments
 (0)