Agent nodes in NBG failed to connect to https://127.0.0.1:6444 #815

dan1el-k · 2023-05-22T22:09:28Z

dan1el-k
May 22, 2023

Hi,

Issue descriptions

I installed a cluster in eu-central region, with 3 control-plane nodes spawn across HEL, FSN, NBG as well as 3 agent nodes.
I used the lastest release v2.1.6.

After the installation, all nodes were added to the cluster and become healthy, except the 1 agent node in NBG. The 1 node stays "unhealthy".

Debugging steps

I started debugging on the node, and found in the "journalctl" logs a lot of those errors

May 22 20:58:59 k3s-worker-nbg1-wtc k3s[1350]: I0522 20:58:59.658425    1350 status_manager.go:667] "Failed to get status for pod" podUID=44c48f1e-ec34-4c48-b9a8-e108e4efd480 pod="kube-system/cilium-lqhgn" err="Get \"https://127.0.0.1:6444/api/v1/namespaces/kube-system/pods/cilium-lqhgn\": net/http: TLS handshake timeout"
May 22 20:59:00 k3s-worker-nbg1-wtc k3s[1350]: E0522 20:59:00.084778    1350 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
May 22 20:59:01 k3s-worker-nbg1-wtc k3s[1350]: W0522 20:59:01.179363    1350 reflector.go:424] object-"kube-system"/"hubble-server-certs": failed to list *v1.Secret: Get "https://127.0.0.1:6444/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dhubble-server-certs&resourceVersion=1680277": net/http: TLS handshake timeout
May 22 20:59:01 k3s-worker-nbg1-wtc k3s[1350]: I0522 20:59:01.179528    1350 trace.go:219] Trace[1662194913]: "Reflector ListAndWatch" name:object-"kube-system"/"hubble-server-certs" (22-May-2023 20:58:51.177) (total time: 10002ms):
May 22 20:59:01 k3s-worker-nbg1-wtc k3s[1350]: Trace[1662194913]: ---"Objects listed" error:Get "https://127.0.0.1:6444/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dhubble-server-certs&resourceVersion=1680277": net/http: TLS handshake timeout 10002ms (20:59:01.179)
May 22 20:59:01 k3s-worker-nbg1-wtc k3s[1350]: Trace[1662194913]: [10.002200076s] [10.002200076s] END

systemctl status k3s-agent is running but also showed a TLS handshake timeout error connection to: https://127.0.0.1:6444/

● k3s-agent.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; preset: disabled)
     Active: active (running) since Fri 2023-05-19 00:55:22 UTC; 3 days ago
       Docs: https://k3s.io
    Process: 1346 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 1348 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1349 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 1350 (k3s-agent)
      Tasks: 84
        CPU: 3h 32min 3.727s
     CGroup: /system.slice/k3s-agent.service
             ├─ 1350 "/usr/local/bin/k3s agent"
             ├─ 8161 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containe>
             ├─ 8351 /var/lib/rancher/k3s/data/feeeb9b2f9234f89a72104f4e1c25b6a2ffe117ddaadbe6791cf09885153bdc3/bin/containerd-shim-runc-v2 -namespace k8>
             ├─ 8485 /var/lib/rancher/k3s/data/feeeb9b2f9234f89a72104f4e1c25b6a2ffe117ddaadbe6791cf09885153bdc3/bin/containerd-shim-runc-v2 -namespace k8>
             ├─13762 /var/lib/rancher/k3s/data/feeeb9b2f9234f89a72104f4e1c25b6a2ffe117ddaadbe6791cf09885153bdc3/bin/containerd-shim-runc-v2 -namespace k8>
             └─13949 /var/lib/rancher/k3s/data/feeeb9b2f9234f89a72104f4e1c25b6a2ffe117ddaadbe6791cf09885153bdc3/bin/containerd-shim-runc-v2 -namespace k8>

May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: W0522 21:04:11.356131    1350 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: I0522 21:04:11.356251    1350 trace.go:219] Trace[94676488]: "Reflector ListAndWatch" name:k8s.io/client-g>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: Trace[94676488]: ---"Objects listed" error:Get "https://127.0.0.1:6444/api/v1/nodes?fieldSelector=metadata>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: Trace[94676488]: [10.001875034s] [10.001875034s] END
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: E0522 21:04:11.356274    1350 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: E0522 21:04:11.452264    1350 pod_workers.go:965] "Error syncing pod, skipping" err="network is not ready:>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: E0522 21:04:11.452747    1350 pod_workers.go:965] "Error syncing pod, skipping" err="network is not ready:>
May 22 21:04:11 k3s-worker-nbg1-wtc k3s[1350]: I0522 21:04:11.713382    1350 status_manager.go:667] "Failed to get status for pod" podUID=364e4d33-8d3d-4>
May 22 21:04:13 k3s-worker-nbg1-wtc k3s[1350]: E0522 21:04:13.452769    1350 pod_workers.go:965] "Error syncing pod, skipping" err="network is not ready:>
May 22 21:04:13 k3s-worker-nbg1-wtc k3s[1350]: E0522 21:04:13.452903    1350 pod_workers.go:965] "Error syncing pod, skipping" err="network is not ready:>
~
~

I manually tried to curl, but also ran into a timeout.

k3s-worker-nbg1-wtc:~ # curl -vk "https://127.0.0.1:6444/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dhubble-server-certs&resourceV
ersion=1680277"
*   Trying 127.0.0.1:6444...
* Connected to 127.0.0.1 (127.0.0.1) port 6444 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* Recv failure: Connection reset by peer
* OpenSSL SSL_connect: Connection reset by peer in connection to 127.0.0.1:6444 
* Closing connection 0
curl: (35) Recv failure: Connection reset by peer

So I went on and compared the settings of /etc/rancher/k3s/config.yaml to one of the other nodes, but they are the same (except the own node-ip).

"flannel-iface": "eth1"
"kubelet-arg":
- "cloud-provider=external"
- "volume-plugin-dir=/var/lib/kubelet/volumeplugins"
"node-ip": "10.0.0.101"
"node-label":
- "k3s_upgrade=true"
"node-name": "k3s-worker-nbg1-wtc"
"node-taint":
- "node.cilium.io/agent-not-ready:NoExecute"
"selinux": true
"server": "https://10.255.0.101:6443"
"token": "y4xfr6oe4djsV8BWpJ4VRa3XRPJjNjBqhQqvB764vBbh54Re"

also curl of the api-server failed

k3s-worker-nbg1-wtc:~ # curl -vk https://10.255.0.101:6443
*   Trying 10.255.0.101:6443...

I checked the Routes and Subnets in the Hetzner console, but all looks fine:

I even spinned up another node in NBG, but it resulted in the same state.
Howerver spinning up another node in another FSN worked fine.

Questions

Are there any firewalls or settings on OS side which could block this ?
Does anyoned has this issue as well on nodes in NBG?
How could it be that the node registered itself to the server, but then is not able to connect to the kube-apiserver on port 6443 or 6444 anymore ?
Does the subnet range of the nodes "10.0.0.0/16" conflicts with some address space of the CNI or something else (I use cilium with encryption enabled) ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Agent nodes in NBG failed to connect to https://127.0.0.1:6444 #815

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Agent nodes in NBG failed to connect to https://127.0.0.1:6444 #815

Uh oh!

Uh oh!

dan1el-k May 22, 2023

Issue descriptions

Debugging steps

Questions

Replies: 0 comments

dan1el-k
May 22, 2023