Skip to content

Etcd member stuck as learner during bootstrap #330

@ArnoldVanN

Description

@ArnoldVanN

Description

Hi, I've been using this module without issue but i recently tried to disable kube-proxy replacement for my staging cluster in order to run Istio, bricking it in the process.
After completely tearing down the cluster and trying to bootstrap again I ran into 2 major issues. The first was 403's on several registries which seems to have been fixed by self hosting mirrors.
After that when destroying and trying to bootstrap once more, the 2nd node that gets created gets stuck in learner. Even after enabling kube-proxy replacement again:

k8s-staging-control-2   etcd                   Running   Fail     10m16s ago    Health check failed: etcdserver: rpc not supported for learner

talosctl etcd members -n k8s-staging-control-1
NODE                    ID                 HOSTNAME                PEER URLS                CLIENT URLS              LEARNER
k8s-staging-control-1   8a94def43f7181ef   k8s-staging-control-2   https://10.0.64.2:2380   https://10.0.64.2:2379   true
k8s-staging-control-1   c555db3b0993326f   k8s-staging-control-1   https://10.0.64.1:2380   https://10.0.64.1:2379   false

While the 3rd node out of 3 control plane nodes gets stuck in Preparing:

k8s-staging-control-3   etcd                   Preparing   ?        14m31s ago    Running pre state

And after some time fails with this error:

k8s-staging-control-3   etcd         Failed    ?        28s ago       Failed to run pre stage: failed to build initial etcd cluster: failed to build cluster arguments: 2 error(s) occurred:
                        error adding member: etcdserver: too many learner members in cluster
                        timeout

Something I did notice is that both nodes receive log "bootstrap request received" and try to spin up etcd pretty much at the same exact time. I'm not experienced enough to know if that's expected

I've tried manually removing the learner, resetting/rebooting nodes, stripping down my config, etc, to no effect.

Expected Behavior

2nd etcd node should be promoted to voting member with 3rd node to follow.

Actual Behavior

2nd node gets stuck as learner, 3rd node etcd service fails with error adding member: etcdserver: too many learner members in cluster

Minimal Module Configuration

module "kubernetes" {
  source  = "hcloud-k8s/kubernetes/hcloud"
  version = "3.22.0"

  # cluster_delete_protection = false

  cluster_name = "k8s-staging"
  hcloud_token = var.hcloud_token
  kube_api_hostname = "k8s-staging-control-1"

  control_plane_nodepools = [
    { name = "control", type = "cx33", location = "fsn1", count = 3 }
  ]

  cert_manager_enabled  = true

  longhorn_enabled = true
  longhorn_default_storage_class = true

  talos_image_extensions = ["siderolabs/tailscale"]
  control_plane_config_patches = [
    {
      apiVersion = "v1alpha1"
      kind       = "ExtensionServiceConfig"
      name       = "tailscale"
      environment = [
        "TS_AUTHKEY=${var.tailscale_authkey}"
      ]
    }
  ]

  talos_registries = {
    mirrors = {
    // mirrors pointing to a tailscale IP
    }
  }

  // required for Istio
  # cilium_kube_proxy_replacement_enabled = false
  cilium_socket_lb_host_namespace_only_enabled = true
  cilium_helm_values = {
    cni = {
      exclusive = false
      # chainingMode = "none" // prevents istio/cilium from infinitely overwriting the cni config file
    }
    # devices: "eth+" // set this manually so talos doesn't try to run eBPF datapath on the tailscale interface
    # // recommended by cilium
    # bpfClockProbe = true
    # bpf = {
    #   distributedLRU = {
    #     enabled = true
    #   }
    #   mapDynamicSizeRatio = 0.08
    # }
  }
}

Relevant Output

apologies for the spam, i'm not sure what information is most relevant

etcd logs for node 2:


k8s-staging-control-2: {"level":"warn","ts":"2026-02-19T22:50:50.612583Z","caller":"embed/config_logging.go:188","msg":"rejected connection on client endpoint","remote-addr":"[::1]:55098","server-name":"localhost","error":"EOF"}
k8s-staging-control-2: {"level":"error","ts":"2026-02-19T22:50:51.504366Z","caller":"etcdserver/server.go:2090","msg":"Validation on configuration change failed","shouldApplyV3":true,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2090\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1918\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1210\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:985\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:855\ngo.etcd.io/etcd/pkg/v3/schedule.job.Do\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:41\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:206\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:187"}
k8s-staging-control-2: {"level":"info","ts":"2026-02-19T22:50:51.504468Z","logger":"raft","caller":"v3@v3.6.0/raft.go:1981","msg":"8a94def43f7181ef switched to configuration voters=(14219512445102404207) learners=(9985851414405022191)"}


etcd logs for node 1:


k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:34.208319Z","caller":"etcdserver/server.go:1768","msg":"applied a configuration change through raft","local-member-id":"c555db3b0993326f","raft-conf-change":"ConfChangeAddLearnerNode","raft-conf-change-node-id":"1ff6a2882d6adc2b"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:34.673636Z","caller":"etcdserver/corrupt.go:278","msg":"starting compact hash check","local-member-id":"c555db3b0993326f","timeout":"7s"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:34.673724Z","caller":"etcdserver/corrupt.go:294","msg":"finished compaction hash check","number-of-hashes-checked":0}
k8s-staging-control-1: {"level":"error","ts":"2026-02-19T22:52:37.575944Z","caller":"etcdserver/server.go:2090","msg":"Validation on configuration change failed","shouldApplyV3":true,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2090\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1918\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1210\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:985\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:855\ngo.etcd.io/etcd/pkg/v3/schedule.job.Do\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:41\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:206\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:187"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:37.576237Z","logger":"raft","caller":"v3@v3.6.0/raft.go:1981","msg":"c555db3b0993326f switched to configuration voters=(14219512445102404207) learners=(9985851414405022191)"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:37.576329Z","caller":"etcdserver/server.go:1768","msg":"applied a configuration change through raft","local-member-id":"c555db3b0993326f","raft-conf-change":"ConfChangeAddLearnerNode","raft-conf-change-node-id":"cc8d0464717ad9b0"}
k8s-staging-control-1: {"level":"error","ts":"2026-02-19T22:52:41.523087Z","caller":"etcdserver/server.go:2090","msg":"Validation on configuration change failed","shouldApplyV3":true,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2090\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1918\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1210\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:985\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:855\ngo.etcd.io/etcd/pkg/v3/schedule.job.Do\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:41\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:206\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/v3@v3.6.5/schedule/schedule.go:187"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:41.523184Z","logger":"raft","caller":"v3@v3.6.0/raft.go:1981","msg":"c555db3b0993326f switched to configuration voters=(14219512445102404207) learners=(9985851414405022191)"}
k8s-staging-control-1: {"level":"info","ts":"2026-02-19T22:52:41.523232Z","caller":"etcdserver/server.go:1768","msg":"applied a configuration change through raft","local-member-id":"c555db3b0993326f","raft-conf-change":"ConfChangeAddLearnerNode","raft-conf-change-node-id":"2b1f3ea31b9f1e59"}

node 1 dmesg

k8s-staging-control-1: user: warning: [2026-02-19T22:28:19.556557486Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "2 errors occurred:\n\t* error creating mapping for object talos.dev/v1alpha1/ServiceAccount/talos-cloud-controller-manager-talos-secrets: no matches for kind \"ServiceAccount\" in version \"talos.dev/v1alpha1\"\n\t* error creating mapping for object talos.dev/v1alpha1/ServiceAccount/talos-backup-secrets: no matches for kind \"ServiceAccount\" in version \"talos.dev/v1alpha1\"\n\n"}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:22.845324486Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:38.397148486Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:50.394325486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:50.421065486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:50.578930486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:50.616463486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:52.022936486Z]: [talos] new diagnostic {"component": "controller-runtime", "controller": "runtime.DiagnosticsLoggerController", "id": "kubelet-csr", "message": "kubelet server certificate rotation is enabled, but CSR is not approved", "details": ["kubelet API error: remote error: tls: internal error", "pending CSRs: csr-x9z5c"], "url": "https://talos.dev/diagnostic/kubelet-csr"}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:52.164848486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:52.199201486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:53.064244486Z]: [talos] created talos.dev/v1alpha1/ServiceAccount/talos-cloud-controller-manager-talos-secrets {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:53.661009486Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
k8s-staging-control-1: kern: warning: [2026-02-19T22:28:53.779351486Z]: virtio_net virtio1 eth0: XDP request 5 queues but max is 1. XDP_TX and XDP_REDIRECT will operate in a slower locked tx mode.
 SUBSYSTEM=virtio
 DEVICE=+virtio:virtio1
k8s-staging-control-1: kern:    info: [2026-02-19T22:28:55.160668486Z]: eth0: renamed from tmp926ca
k8s-staging-control-1: user: warning: [2026-02-19T22:28:55.209634486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:55.246877486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:56.777506486Z]: [talos] machine is running and ready {"component": "controller-runtime", "controller": "runtime.MachineStatusController"}
k8s-staging-control-1: kern:    info: [2026-02-19T22:28:57.277683486Z]: eth0: renamed from tmpe8ca1
k8s-staging-control-1: user: warning: [2026-02-19T22:28:57.278080486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: user: warning: [2026-02-19T22:28:57.321235486Z]: [talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.64.126", "network_id": 11955308}
k8s-staging-control-1: kern:    info: [2026-02-19T22:28:57.334619486Z]: eth0: renamed from tmpaf114

Confirmation

  • I checked existing issues, discussions, and the web for similar problems

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions