Skip to content

Subsequent control plane nodes in HA clusters sometimes fail to kubeadm join #5687

@nojnhuh

Description

@nojnhuh

Which jobs are flaky:
https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=Timed%20out%20waiting%20for%20%5Cd%2B%20control%20plane%20machines%20to%20exist&job=cluster-api-provider-azure

Which tests are flaky:

Testgrid link:

Reason for failure (if possible):
During kubeadm join, a timeout is hit when connecting to the existing etcd members.

Failures usually occur during [etcd] Adding etcd member or [etcd] Promoting a learner as a voting member, e.g. from this run (these logs):

[2025-06-09 01:42:48] I0609 01:42:48.646536    1609 local.go:157] [etcd] Adding etcd member: https://10.0.0.6:2380
[2025-06-09 01:42:50] context deadline exceeded
[2025-06-09 01:42:50] error creating local etcd static pod manifest file
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:171
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:185
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).execute
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:985
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).ExecuteC
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:1117
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).Execute
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:1041
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app.Run
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:47
[2025-06-09 01:42:50] main.main
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
[2025-06-09 01:42:50] runtime.main
[2025-06-09 01:42:50] 	runtime/proc.go:272
[2025-06-09 01:42:50] runtime.goexit
[2025-06-09 01:42:50] 	runtime/asm_amd64.s:1700
[2025-06-09 01:42:50] error execution phase etcd-join
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:262
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:185
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).execute
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:985
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).ExecuteC
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:1117
[2025-06-09 01:42:50] github.com/spf13/cobra.(*Command).Execute
[2025-06-09 01:42:50] 	github.com/spf13/[email protected]/command.go:1041
[2025-06-09 01:42:50] k8s.io/kubernetes/cmd/kubeadm/app.Run
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:47
[2025-06-09 01:42:50] main.main
[2025-06-09 01:42:50] 	k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
[2025-06-09 01:42:50] runtime.main
[2025-06-09 01:42:50] 	runtime/proc.go:272
[2025-06-09 01:42:50] runtime.goexit
[2025-06-09 01:42:50] 	runtime/asm_amd64.s:1700
[2025-06-09 01:42:50] 2025-06-09 01:42:50,651 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
[2025-06-09 01:42:50] 2025-06-09 01:42:50,651 - util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
[2025-06-09 01:42:52] Cloud-init v. 24.3.1-0ubuntu0~24.04.2 finished at Mon, 09 Jun 2025 01:42:52 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 65.88 seconds

Anything else we need to know:

/kind flake

[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Metadata

Metadata

Assignees

Labels

kind/flakeCategorizes issue or PR as related to a flaky test.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions