You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[Migrate Docker Engine nodes from dockershim to cri-dockerd](/docs/tasks/administer-cluster/migrating-from-dockershim/migrate-dockershim-dockerd/)
34
34
*[Migrating telemetry and security agents from dockershim](/docs/tasks/administer-cluster/migrating-from-dockershim/migrating-telemetry-and-security-agents/)
Copy file name to clipboardExpand all lines: content/en/docs/tasks/debug/debug-cluster/_index.md
+61-48Lines changed: 61 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,13 @@ kubectl cluster-info dump
36
36
37
37
### Example: debugging a down/unreachable node
38
38
39
-
Sometimes when debugging it can be useful to look at the status of a node -- for example, because you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod won't schedule onto the node. As with Pods, you can use `kubectl describe node` and `kubectl get node -o yaml` to retrieve detailed information about nodes. For example, here's what you'll see if a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).
39
+
Sometimes when debugging it can be useful to look at the status of a node -- for example, because
40
+
you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod
41
+
won't schedule onto the node. As with Pods, you can use `kubectl describe node` and `kubectl get
42
+
node -o yaml` to retrieve detailed information about nodes. For example, here's what you'll see if
43
+
a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice
44
+
the events that show the node is NotReady, and also notice that the pods are no longer running
45
+
(they are evicted after five minutes of NotReady status).
40
46
41
47
```shell
42
48
kubectl get nodes
@@ -222,60 +228,63 @@ of the relevant log files. On systemd-based systems, you may need to use `journ
222
228
223
229
### Control Plane nodes
224
230
225
-
* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
226
-
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
227
-
* `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling (the kube-scheduler handles scheduling).
231
+
* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
232
+
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
233
+
* `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in
234
+
{{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling
235
+
(the kube-scheduler handles scheduling).
228
236
229
237
### Worker Nodes
230
238
231
-
* `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
232
-
* `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
239
+
* `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
240
+
* `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
233
241
234
242
## Cluster failure modes
235
243
236
244
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
237
245
238
246
### Contributing causes
239
247
240
-
- VM(s) shutdown
241
-
- Network partition within cluster, or between cluster and users
242
-
- Crashes in Kubernetes software
243
-
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
244
-
- Operator error, for example misconfigured Kubernetes software or application software
248
+
- VM(s) shutdown
249
+
- Network partition within cluster, or between cluster and users
250
+
- Crashes in Kubernetes software
251
+
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
252
+
- Operator error, for example misconfigured Kubernetes software or application software
245
253
246
254
### Specific scenarios
247
255
248
-
- API server VM shutdown or apiserver crashing
249
-
- Results
250
-
- unable to stop, update, or start new pods, services, replication controller
251
-
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
252
-
- API server backing storage lost
253
-
- Results
254
-
- the kube-apiserver component fails to start successfully and become healthy
255
-
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
256
-
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
257
-
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
258
-
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
259
-
- in future, these will be replicated as well and may not be co-located
260
-
- they do not have their own persistent state
261
-
- Individual node (VM or physical machine) shuts down
262
-
- Results
263
-
- pods on that Node stop running
264
-
- Network partition
265
-
- Results
266
-
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
267
-
- Kubelet software fault
268
-
- Results
269
-
- crashing kubelet cannot start new pods on the node
270
-
- kubelet might delete the pods or not
271
-
- node marked unhealthy
272
-
- replication controllers start new pods elsewhere
273
-
- Cluster operator error
274
-
- Results
275
-
- loss of pods, services, etc
276
-
- lost of apiserver backing store
277
-
- users unable to read API
278
-
- etc.
256
+
- API server VM shutdown or apiserver crashing
257
+
- Results
258
+
- unable to stop, update, or start new pods, services, replication controller
259
+
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
260
+
- API server backing storage lost
261
+
- Results
262
+
- the kube-apiserver component fails to start successfully and become healthy
263
+
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
264
+
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
265
+
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
266
+
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
267
+
- in future, these will be replicated as well and may not be co-located
268
+
- they do not have their own persistent state
269
+
- Individual node (VM or physical machine) shuts down
270
+
- Results
271
+
- pods on that Node stop running
272
+
- Network partition
273
+
- Results
274
+
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down.
275
+
(Assuming the master VM ends up in partition A.)
276
+
- Kubelet software fault
277
+
- Results
278
+
- crashing kubelet cannot start new pods on the node
279
+
- kubelet might delete the pods or not
280
+
- node marked unhealthy
281
+
- replication controllers start new pods elsewhere
282
+
- Cluster operator error
283
+
- Results
284
+
- loss of pods, services, etc
285
+
- lost of apiserver backing store
286
+
- users unable to read API
287
+
- etc.
279
288
280
289
### Mitigations
281
290
@@ -308,9 +317,13 @@ This is an incomplete list of things that could go wrong, and how to adjust your
308
317
309
318
## {{% heading "whatsnext" %}}
310
319
311
-
* Learn about the metrics available in the [Resource Metrics Pipeline](resource-metrics-pipeline)
312
-
* Discover additional tools for [monitoring resource usage](resource-usage-monitoring)
313
-
* Use Node Problem Detector to [monitor node health](monitor-node-health)
314
-
* Use `crictl` to [debug Kubernetes nodes](crictl)
315
-
* Get more information about [Kubernetes auditing](audit)
316
-
* Use `telepresence` to [develop and debug services locally](local-debugging)
Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*"
103
103
```
104
104
105
-
Often it is worthwhile to modify the [InterfaceName](https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/start.ps1#L7) parameter of the `start.ps1` script,
106
-
in cases where the host's network adapter isn't "Ethernet".
105
+
Often it is worthwhile to modify the [InterfaceName](https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/start.ps1#L7)
106
+
parameter of the `start.ps1` script, in cases where the host's network adapter isn't "Ethernet".
107
107
Otherwise, consult the output of the `start-kubelet.ps1` script to see if there are errors during virtual network creation.
108
108
109
109
1. DNS resolution is not properly working
@@ -112,9 +112,11 @@ content_type: concept
112
112
113
113
1.`kubectl port-forward` fails with "unable to do port forwarding: wincat not found"
114
114
115
-
This was implemented in Kubernetes 1.15 by including `wincat.exe` in the pause infrastructure container `mcr.microsoft.com/oss/kubernetes/pause:3.6`.
115
+
This was implemented in Kubernetes 1.15 by including `wincat.exe` in the pause infrastructure container
116
+
`mcr.microsoft.com/oss/kubernetes/pause:3.6`.
116
117
Be sure to use a supported version of Kubernetes.
117
-
If you would like to build your own pause infrastructure container be sure to include [wincat](https://github.com/kubernetes/kubernetes/tree/master/build/pause/windows/wincat).
118
+
If you would like to build your own pause infrastructure container be sure to include
0 commit comments