You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/docs/concepts/architecture/nodes.md
+39-36Lines changed: 39 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ The [components](/docs/concepts/overview/components/#node-components) on a node
31
31
There are two main ways to have Nodes added to the {{< glossary_tooltip text="API server" term_id="kube-apiserver" >}}:
32
32
33
33
1. The kubelet on a node self-registers to the control plane
34
-
2. You, or another human user, manually add a Node object
34
+
2. You (or another human user) manually add a Node object
35
35
36
36
After you create a Node object, or the kubelet on a node self-registers, the
37
37
control plane checks whether the new Node object is valid. For example, if you
@@ -52,8 +52,8 @@ try to create a Node from the following JSON manifest:
52
52
53
53
Kubernetes creates a Node object internally (the representation). Kubernetes checks
54
54
that a kubelet has registered to the API server that matches the `metadata.name`
55
-
field of the Node. If the node is healthy (if all necessary services are running),
56
-
it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
55
+
field of the Node. If the node is healthy (i.e. all necessary services are running),
56
+
then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
57
57
until it becomes healthy.
58
58
59
59
{{< note >}}
@@ -96,14 +96,14 @@ You can create and modify Node objects using
96
96
When you want to create Node objects manually, set the kubelet flag `--register-node=false`.
97
97
98
98
You can modify Node objects regardless of the setting of `--register-node`.
99
-
For example, you can set labels on an existing Node, or mark it unschedulable.
99
+
For example, you can set labels on an existing Node or mark it unschedulable.
100
100
101
101
You can use labels on Nodes in conjunction with node selectors on Pods to control
102
102
scheduling. For example, you can constrain a Pod to only be eligible to run on
103
103
a subset of the available nodes.
104
104
105
105
Marking a node as unschedulable prevents the scheduler from placing new pods onto
106
-
that Node, but does not affect existing Pods on the Node. This is useful as a
106
+
that Node but does not affect existing Pods on the Node. This is useful as a
107
107
preparatory step before a node reboot or other maintenance.
108
108
109
109
To mark a Node unschedulable, run:
@@ -179,14 +179,14 @@ The node condition is represented as a JSON object. For example, the following s
179
179
]
180
180
```
181
181
182
-
If the Status of the Ready condition remains `Unknown` or `False` for longer than the `pod-eviction-timeout` (an argument passed to the {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}), all the Pods on the node are scheduled for deletion by the node controller. The default eviction timeout duration is **five minutes**. In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
182
+
If the Status of the Ready condition remains `Unknown` or `False` for longer than the `pod-eviction-timeout` (an argument passed to the {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}), then all the Pods on the node are scheduled for deletion by the node controller. The default eviction timeout duration is **five minutes**. In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
183
183
184
184
The node controller does not force delete pods until it is confirmed that they have stopped
185
185
running in the cluster. You can see the pods that might be running on an unreachable node as
186
186
being in the `Terminating` or `Unknown` state. In cases where Kubernetes cannot deduce from the
187
187
underlying infrastructure if a node has permanently left a cluster, the cluster administrator
188
-
may need to delete the node object by hand. Deleting the node object from Kubernetes causes
189
-
all the Pod objects running on the node to be deleted from the API server, and frees up their
188
+
may need to delete the node object by hand. Deleting the node object from Kubernetes causes
189
+
all the Pod objects running on the node to be deleted from the API server and frees up their
190
190
names.
191
191
192
192
The node lifecycle controller automatically creates
@@ -199,7 +199,7 @@ for more details.
199
199
200
200
### Capacity and Allocatable {#capacity}
201
201
202
-
Describes the resources available on the node: CPU, memory and the maximum
202
+
Describes the resources available on the node: CPU, memory, and the maximum
203
203
number of pods that can be scheduled onto the node.
204
204
205
205
The fields in the capacity block indicate the total amount of resources that a
@@ -225,18 +225,19 @@ CIDR block to the node when it is registered (if CIDR assignment is turned on).
225
225
226
226
The second is keeping the node controller's internal list of nodes up to date with
227
227
the cloud provider's list of available machines. When running in a cloud
228
-
environment, whenever a node is unhealthy, the node controller asks the cloud
228
+
environment and whenever a node is unhealthy, the node controller asks the cloud
229
229
provider if the VM for that node is still available. If not, the node
230
230
controller deletes the node from its list of nodes.
231
231
232
232
The third is monitoring the nodes' health. The node controller is
233
-
responsible for updating the NodeReady condition of NodeStatus to
234
-
ConditionUnknown when a node becomes unreachable (i.e. the node controller stops
235
-
receiving heartbeats for some reason, for example due to the node being down), and then later evicting
236
-
all the pods from the node (using graceful termination) if the node continues
237
-
to be unreachable. (The default timeouts are 40s to start reporting
238
-
ConditionUnknown and 5m after that to start evicting pods.) The node controller
239
-
checks the state of each node every `--node-monitor-period` seconds.
233
+
responsible for:
234
+
- Updating the NodeReady condition of NodeStatus to ConditionUnknown when a node
235
+
becomes unreachable, as the node controller stops receiving heartbeats for some
236
+
reason such as the node being down.
237
+
- Evicting all the pods from the node using graceful termination if
238
+
the node continues to be unreachable. The default timeouts are 40s to start
239
+
reporting ConditionUnknown and 5m after that to start evicting pods.
240
+
The node controller checks the state of each node every `--node-monitor-period` seconds.
240
241
241
242
#### Heartbeats
242
243
@@ -252,13 +253,14 @@ of the node heartbeats as the cluster scales.
252
253
The kubelet is responsible for creating and updating the `NodeStatus` and
253
254
a Lease object.
254
255
255
-
- The kubelet updates the `NodeStatus` either when there is change in status,
256
+
- The kubelet updates the `NodeStatus` either when there is change in status
256
257
or if there has been no update for a configured interval. The default interval
257
-
for `NodeStatus` updates is 5 minutes (much longer than the 40 second default
258
-
timeout for unreachable nodes).
258
+
for `NodeStatus` updates is 5 minutes, which is much longer than the 40 second default
259
+
timeout for unreachable nodes.
259
260
- The kubelet creates and then updates its Lease object every 10 seconds
260
261
(the default update interval). Lease updates occur independently from the
261
-
`NodeStatus` updates. If the Lease update fails, the kubelet retries with exponential backoff starting at 200 milliseconds and capped at 7 seconds.
262
+
`NodeStatus` updates. If the Lease update fails, the kubelet retries with
263
+
exponential backoff starting at 200 milliseconds and capped at 7 seconds.
262
264
263
265
#### Reliability
264
266
@@ -269,23 +271,24 @@ from more than 1 node per 10 seconds.
269
271
The node eviction behavior changes when a node in a given availability zone
270
272
becomes unhealthy. The node controller checks what percentage of nodes in the zone
271
273
are unhealthy (NodeReady condition is ConditionUnknown or ConditionFalse) at
272
-
the same time. If the fraction of unhealthy nodes is at least
273
-
`--unhealthy-zone-threshold` (default 0.55) then the eviction rate is reduced:
274
-
if the cluster is small (i.e. has less than or equal to
275
-
`--large-cluster-size-threshold` nodes - default 50) then evictions are
276
-
stopped, otherwise the eviction rate is reduced to
277
-
`--secondary-node-eviction-rate` (default 0.01) per second. The reason these
278
-
policies are implemented per availability zone is because one availability zone
279
-
might become partitioned from the master while the others remain connected. If
280
-
your cluster does not span multiple cloud provider availability zones, then
281
-
there is only one availability zone (the whole cluster).
274
+
the same time:
275
+
- If the fraction of unhealthy nodes is at least `--unhealthy-zone-threshold`
276
+
(default 0.55), then the eviction rate is reduced.
277
+
- If the cluster is small (i.e. has less than or equal to
278
+
`--large-cluster-size-threshold` nodes - default 50), then evictions are stopped.
279
+
- Otherwise, the eviction rate is reduced to `--secondary-node-eviction-rate`
280
+
(default 0.01) per second.
281
+
The reason these policies are implemented per availability zone is because one
282
+
availability zone might become partitioned from the master while the others remain
283
+
connected. If your cluster does not span multiple cloud provider availability zones,
284
+
then there is only one availability zone (i.e. the whole cluster).
282
285
283
286
A key reason for spreading your nodes across availability zones is so that the
284
287
workload can be shifted to healthy zones when one entire zone goes down.
285
-
Therefore, if all nodes in a zone are unhealthy then the node controller evicts at
288
+
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
286
289
the normal rate of `--node-eviction-rate`. The corner case is when all zones are
287
290
completely unhealthy (i.e. there are no healthy nodes in the cluster). In such a
288
-
case, the node controller assumes that there's some problem with master
291
+
case, the node controller assumes that there is some problem with master
289
292
connectivity and stops all evictions until some connectivity is restored.
290
293
291
294
The node controller is also responsible for evicting pods running on nodes with
@@ -303,8 +306,8 @@ eligible for, effectively removing incoming load balancer traffic from the cordo
303
306
304
307
### Node capacity
305
308
306
-
Node objects track information about the Node's resource capacity (for example: the amount
307
-
of memory available, and the number of CPUs).
309
+
Node objects track information about the Node's resource capacity: for example, the amount
310
+
of memory available and the number of CPUs.
308
311
Nodes that [self register](#self-registration-of-nodes) report their capacity during
309
312
registration. If you [manually](#manual-node-administration) add a Node, then
310
313
you need to set the node's capacity information when you add it.
@@ -338,7 +341,7 @@ for more information.
338
341
If you have enabled the `GracefulNodeShutdown`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), then the kubelet attempts to detect the node system shutdown and terminates pods running on the node.
339
342
Kubelet ensures that pods follow the normal [pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) during the node shutdown.
340
343
341
-
When the `GracefulNodeShutdown` feature gate is enabled, kubelet uses [systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to delay the node shutdown with a given duration. During a shutdown kubelet terminates pods in two phases:
344
+
When the `GracefulNodeShutdown` feature gate is enabled, kubelet uses [systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to delay the node shutdown with a given duration. During a shutdown, kubelet terminates pods in two phases:
342
345
343
346
1. Terminate regular pods running on the node.
344
347
2. Terminate [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical) running on the node.
0 commit comments