Skip to content

Commit 891a689

Browse files
authored
Merge pull request #44105 from windsonsea/shooty
Clean up troubleshooting-kubeadm.md
2 parents b3d0f3e + b231bcf commit 891a689

File tree

1 file changed

+82
-33
lines changed

1 file changed

+82
-33
lines changed

content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md

Lines changed: 82 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@ If you see the following warnings while running `kubeadm init`
7373
[preflight] WARNING: ethtool not found in system path
7474
```
7575

76-
Then you may be missing `ebtables`, `ethtool` or a similar executable on your node. You can install them with the following commands:
76+
Then you may be missing `ebtables`, `ethtool` or a similar executable on your node.
77+
You can install them with the following commands:
7778

7879
- For Ubuntu/Debian users, run `apt install ebtables ethtool`.
7980
- For CentOS/Fedora users, run `yum install ebtables ethtool`.
@@ -90,9 +91,9 @@ This may be caused by a number of problems. The most common are:
9091

9192
- network connection problems. Check that your machine has full network connectivity before continuing.
9293
- the cgroup driver of the container runtime differs from that of the kubelet. To understand how to
93-
configure it properly see [Configuring a cgroup driver](/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/).
94+
configure it properly, see [Configuring a cgroup driver](/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/).
9495
- control plane containers are crashlooping or hanging. You can check this by running `docker ps`
95-
and investigating each container by running `docker logs`. For other container runtime see
96+
and investigating each container by running `docker logs`. For other container runtime, see
9697
[Debugging Kubernetes nodes with crictl](/docs/tasks/debug/debug-cluster/crictl/).
9798

9899
## kubeadm blocks when removing managed containers
@@ -144,10 +145,12 @@ provider. Please contact the author of the Pod Network add-on to find out whethe
144145

145146
Calico, Canal, and Flannel CNI providers are verified to support HostPort.
146147

147-
For more information, see the [CNI portmap documentation](https://github.com/containernetworking/plugins/blob/master/plugins/meta/portmap/README.md).
148+
For more information, see the
149+
[CNI portmap documentation](https://github.com/containernetworking/plugins/blob/master/plugins/meta/portmap/README.md).
148150

149-
If your network provider does not support the portmap CNI plugin, you may need to use the [NodePort feature of
150-
services](/docs/concepts/services-networking/service/#type-nodeport) or use `HostNetwork=true`.
151+
If your network provider does not support the portmap CNI plugin, you may need to use the
152+
[NodePort feature of services](/docs/concepts/services-networking/service/#type-nodeport)
153+
or use `HostNetwork=true`.
151154

152155
## Pods are not accessible via their Service IP
153156

@@ -157,9 +160,10 @@ services](/docs/concepts/services-networking/service/#type-nodeport) or use `Hos
157160
add-on provider to get the latest status of their support for hairpin mode.
158161

159162
- If you are using VirtualBox (directly or via Vagrant), you will need to
160-
ensure that `hostname -i` returns a routable IP address. By default the first
163+
ensure that `hostname -i` returns a routable IP address. By default, the first
161164
interface is connected to a non-routable host-only network. A work around
162-
is to modify `/etc/hosts`, see this [Vagrantfile](https://github.com/errordeveloper/k8s-playground/blob/22dd39dfc06111235620e6c4404a96ae146f26fd/Vagrantfile#L11)
165+
is to modify `/etc/hosts`, see this
166+
[Vagrantfile](https://github.com/errordeveloper/k8s-playground/blob/22dd39dfc06111235620e6c4404a96ae146f26fd/Vagrantfile#L11)
163167
for an example.
164168

165169
## TLS certificate errors
@@ -175,6 +179,7 @@ Unable to connect to the server: x509: certificate signed by unknown authority (
175179
regenerate a certificate if necessary. The certificates in a kubeconfig file
176180
are base64 encoded. The `base64 --decode` command can be used to decode the certificate
177181
and `openssl x509 -text -noout` can be used for viewing the certificate information.
182+
178183
- Unset the `KUBECONFIG` environment variable using:
179184

180185
```sh
@@ -190,15 +195,16 @@ Unable to connect to the server: x509: certificate signed by unknown authority (
190195
- Another workaround is to overwrite the existing `kubeconfig` for the "admin" user:
191196

192197
```sh
193-
mv $HOME/.kube $HOME/.kube.bak
198+
mv $HOME/.kube $HOME/.kube.bak
194199
mkdir $HOME/.kube
195200
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
196201
sudo chown $(id -u):$(id -g) $HOME/.kube/config
197202
```
198203

199204
## Kubelet client certificate rotation fails {#kubelet-client-cert}
200205

201-
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the `/var/lib/kubelet/pki/kubelet-client-current.pem` symlink specified in `/etc/kubernetes/kubelet.conf`.
206+
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the
207+
`/var/lib/kubelet/pki/kubelet-client-current.pem` symlink specified in `/etc/kubernetes/kubelet.conf`.
202208
If this rotation process fails you might see errors such as `x509: certificate has expired or is not yet valid`
203209
in kube-apiserver logs. To fix the issue you must follow these steps:
204210

@@ -231,24 +237,34 @@ The following error might indicate that something was wrong in the pod network:
231237
Error from server (NotFound): the server could not find the requested resource
232238
```
233239

234-
- If you're using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel.
240+
- If you're using flannel as the pod network inside Vagrant, then you will have to
241+
specify the default interface name for flannel.
235242

236-
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address `10.0.2.15`, is for external traffic that gets NATed.
243+
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts
244+
are assigned the IP address `10.0.2.15`, is for external traffic that gets NATed.
237245

238-
This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the `--iface eth1` flag to flannel so that the second interface is chosen.
246+
This may lead to problems with flannel, which defaults to the first interface on a host.
247+
This leads to all hosts thinking they have the same public IP address. To prevent this,
248+
pass the `--iface eth1` flag to flannel so that the second interface is chosen.
239249

240250
## Non-public IP used for containers
241251

242-
In some situations `kubectl logs` and `kubectl run` commands may return with the following errors in an otherwise functional cluster:
252+
In some situations `kubectl logs` and `kubectl run` commands may return with the
253+
following errors in an otherwise functional cluster:
243254

244255
```console
245256
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
246257
```
247258

248-
- This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider.
249-
- DigitalOcean assigns a public IP to `eth0` as well as a private one to be used internally as anchor for their floating IP feature, yet `kubelet` will pick the latter as the node's `InternalIP` instead of the public one.
259+
- This may be due to Kubernetes using an IP that can not communicate with other IPs on
260+
the seemingly same subnet, possibly by policy of the machine provider.
261+
- DigitalOcean assigns a public IP to `eth0` as well as a private one to be used internally
262+
as anchor for their floating IP feature, yet `kubelet` will pick the latter as the node's
263+
`InternalIP` instead of the public one.
250264

251-
Use `ip addr show` to check for this scenario instead of `ifconfig` because `ifconfig` will not display the offending alias IP address. Alternatively an API endpoint specific to DigitalOcean allows to query for the anchor IP from the droplet:
265+
Use `ip addr show` to check for this scenario instead of `ifconfig` because `ifconfig` will
266+
not display the offending alias IP address. Alternatively an API endpoint specific to
267+
DigitalOcean allows to query for the anchor IP from the droplet:
252268

253269
```sh
254270
curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address
@@ -270,12 +286,13 @@ Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc6
270286

271287
## `coredns` pods have `CrashLoopBackOff` or `Error` state
272288

273-
If you have nodes that are running SELinux with an older version of Docker you might experience a scenario
274-
where the `coredns` pods are not starting. To solve that you can try one of the following options:
289+
If you have nodes that are running SELinux with an older version of Docker, you might experience a scenario
290+
where the `coredns` pods are not starting. To solve that, you can try one of the following options:
275291

276292
- Upgrade to a [newer version of Docker](/docs/setup/production-environment/container-runtimes/#docker).
277293

278294
- [Disable SELinux](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security-enhanced_linux/sect-security-enhanced_linux-enabling_and_disabling_selinux-disabling_selinux).
295+
279296
- Modify the `coredns` deployment to set `allowPrivilegeEscalation` to `true`:
280297

281298
```bash
@@ -284,7 +301,8 @@ kubectl -n kube-system get deployment coredns -o yaml | \
284301
kubectl apply -f -
285302
```
286303

287-
Another cause for CoreDNS to have `CrashLoopBackOff` is when a CoreDNS Pod deployed in Kubernetes detects a loop. [A number of workarounds](https://github.com/coredns/coredns/tree/master/plugin/loop#troubleshooting-loops-in-kubernetes-clusters)
304+
Another cause for CoreDNS to have `CrashLoopBackOff` is when a CoreDNS Pod deployed in Kubernetes detects a loop.
305+
[A number of workarounds](https://github.com/coredns/coredns/tree/master/plugin/loop#troubleshooting-loops-in-kubernetes-clusters)
288306
are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
289307

290308
{{< warning >}}
@@ -300,7 +318,7 @@ If you encounter the following error:
300318
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""
301319
```
302320

303-
this issue appears if you run CentOS 7 with Docker 1.13.1.84.
321+
This issue appears if you run CentOS 7 with Docker 1.13.1.84.
304322
This version of Docker can prevent the kubelet from executing into the etcd container.
305323

306324
To work around the issue, choose one of these options:
@@ -344,6 +362,7 @@ to pick up the node's IP address properly and has knock-on effects to the proxy
344362
load balancers.
345363

346364
The following error can be seen in kube-proxy Pods:
365+
347366
```
348367
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
349368
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
@@ -352,8 +371,26 @@ proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
352371
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane
353372
nodes regardless of their conditions, keeping it off of other nodes until their initial guarding
354373
conditions abate:
374+
355375
```
356-
kubectl -n kube-system patch ds kube-proxy -p='{ "spec": { "template": { "spec": { "tolerations": [ { "key": "CriticalAddonsOnly", "operator": "Exists" }, { "effect": "NoSchedule", "key": "node-role.kubernetes.io/control-plane" } ] } } } }'
376+
kubectl -n kube-system patch ds kube-proxy -p='{
377+
"spec": {
378+
"template": {
379+
"spec": {
380+
"tolerations": [
381+
{
382+
"key": "CriticalAddonsOnly",
383+
"operator": "Exists"
384+
},
385+
{
386+
"effect": "NoSchedule",
387+
"key": "node-role.kubernetes.io/control-plane"
388+
}
389+
]
390+
}
391+
}
392+
}
393+
}'
357394
```
358395

359396
The tracking issue for this problem is [here](https://github.com/kubernetes/kubeadm/issues/1027).
@@ -365,12 +402,15 @@ For [flex-volume support](https://github.com/kubernetes/community/blob/ab55d85/c
365402
Kubernetes components like the kubelet and kube-controller-manager use the default path of
366403
`/usr/libexec/kubernetes/kubelet-plugins/volume/exec/`, yet the flex-volume directory _must be writeable_
367404
for the feature to work.
368-
(**Note**: FlexVolume was deprecated in the Kubernetes v1.23 release)
369405

370-
To workaround this issue you can configure the flex-volume directory using the kubeadm
406+
{{< note >}}
407+
FlexVolume was deprecated in the Kubernetes v1.23 release.
408+
{{< /note >}}
409+
410+
To workaround this issue, you can configure the flex-volume directory using the kubeadm
371411
[configuration file](/docs/reference/config-api/kubeadm-config.v1beta3/).
372412

373-
On the primary control-plane Node (created using `kubeadm init`) pass the following
413+
On the primary control-plane Node (created using `kubeadm init`), pass the following
374414
file using `--config`:
375415

376416
```yaml
@@ -402,7 +442,10 @@ be advised that this is modifying a design principle of the Linux distribution.
402442

403443
## `kubeadm upgrade plan` prints out `context deadline exceeded` error message
404444

405-
This error message is shown when upgrading a Kubernetes cluster with `kubeadm` in the case of running an external etcd. This is not a critical bug and happens because older versions of kubeadm perform a version check on the external etcd cluster. You can proceed with `kubeadm upgrade apply ...`.
445+
This error message is shown when upgrading a Kubernetes cluster with `kubeadm` in
446+
the case of running an external etcd. This is not a critical bug and happens because
447+
older versions of kubeadm perform a version check on the external etcd cluster.
448+
You can proceed with `kubeadm upgrade apply ...`.
406449

407450
This issue is fixed as of version 1.19.
408451

@@ -422,6 +465,7 @@ can be used insecurely by passing the `--kubelet-insecure-tls` to it. This is no
422465
If you want to use TLS between the metrics-server and the kubelet there is a problem,
423466
since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors
424467
on the side of the metrics-server:
468+
425469
```
426470
x509: certificate signed by unknown authority
427471
x509: certificate is valid for IP-foo not IP-bar
@@ -438,6 +482,7 @@ Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3
438482
where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
439483

440484
Here is the error message you may encounter:
485+
441486
```
442487
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
443488
[upgrade/etcd] Waiting for previous etcd to become available
@@ -454,16 +499,19 @@ k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
454499
...
455500
```
456501

457-
The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec.
458-
This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
502+
The reason for this failure is that the affected versions generate an etcd manifest file with
503+
unwanted defaults in the PodSpec. This will result in a diff from the manifest comparison,
504+
and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
459505

460506
There are two way to workaround this issue if you see it in your cluster:
507+
461508
- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
462-
```shell
463-
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
464-
```
465509

466-
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
510+
```shell
511+
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
512+
```
513+
514+
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
467515

468516
- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
469517

@@ -509,4 +557,5 @@ This is not recommended in case a new etcd version was introduced by a later v1.
509557
path: /etc/kubernetes/pki/etcd
510558
```
511559

512-
More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.
560+
More information can be found in the
561+
[tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.

0 commit comments

Comments
 (0)