Skip to content

remove-etcd-node is unable to remove node when the target node is shutdown/unreachable #12986

@ICHx

Description

@ICHx

What happened?

TASK [remove-node/remove-etcd-node : Remove member from cluster] *****************************
fatal: [timy-rm-c7 -> timy-rm-c4]: FAILED! => {"msg": "The conditional check 'etcd_removed_nodes != []' failed. The error was: An unhandled exception occurred while templating '{{ (etcd_members.stdout | from_json).members | selectattr('peerURLs.0', '==', etcd_peer_url) }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while templating 'https://{{ etcd_access_address | ansible.utils.ipwrap }}:2380'. Error was a <class 'ansible.errors.AnsibleFilterError'>, original message: Unrecognized type <<class 'ansible.template.native_helpers.AnsibleUndefined'>> for ipwrap filter <value>

The error appears to be in '/home/rocky/k8s-deploy/roles/remove-node/remove-etcd-node/tasks/main.yml': line 38, column 11, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

      # This should always have at most one member, since the etcd_peer_url should be unique in the etcd cluster
    when: etcd_removed_nodes != []
          ^ here
"}

etcd_access_address: "{{ hostvars[inventory_hostname]['main_access_ip'] }}" IP is undefined when node is turned off.

What did you expect to happen?

Able to remove node from cluster using ip in hosts even when the node is unreachable

How can we reproduce it (as minimally and precisely as possible)?

  1. create the cluster with the hosts file
  2. c7 is intentionally shutdown, trying to remove the node from the cluster with extra vars reset_nodes=false allow_ungraceful_removal=true

OS

RHEL 8

Version of Ansible

[rocky@k8s-node-001 ~]$ ansible-playbook --version
ansible-playbook [core 2.16.15]
config file = None
configured module search path = ['/home/rocky/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/rocky/k8s-deploy/ansible-py3/ansible
ansible collection location = /home/rocky/.ansible/collections:/usr/share/ansible/collections
executable location = /home/rocky/k8s-deploy/ansible-playbook-py3
python version = 3.12.12 (main, Jan 6 2026, 17:15:29) [GCC 8.5.0 20210514 (Red Hat 8.5.0-28)] (/usr/bin/python3.12)
jinja version = 3.1.6
libyaml = True

Version of Python

3.12.12

Version of Kubespray (commit)

22fb8f8

Network plugin used

flannel

Full inventory with variables

[all:vars]
ansible_user='rocky'
ansible_ssh_private_key_file='/home/rocky/k8s-deploy/xxx-us-west-2.pem'
#ansible_ssh_pass=''
ansible_become=true
ansible_become_method='sudo'
#ansible_become_pass=''
# Uncomment and set ansible_python_interpreter to '/usr/libexec/platform-python' when OS is RockyLinux 8
ansible_python_interpreter='/usr/libexec/platform-python'
# Uncomment and configure 'ip' variable to bind K8s services on a different IP than the default network interface IP
[all]
timy-rm-c1 ansible_host=10.10.1.171
timy-rm-c2 ansible_host=10.10.1.208
timy-rm-c3 ansible_host=10.10.1.229
timy-rm-c4 ansible_host=10.10.1.227
timy-rm-c5 ansible_host=10.10.1.141
timy-rm-c6 ansible_host=10.10.1.31
timy-rm-c7 ansible_host=10.10.1.201
[etcd]
timy-rm-c1
timy-rm-c2
timy-rm-c3
timy-rm-c4
timy-rm-c5
timy-rm-c6
timy-rm-c7
[kube_control_plane]
timy-rm-c1
timy-rm-c2
timy-rm-c3
timy-rm-c4
timy-rm-c5
timy-rm-c6
timy-rm-c7
[kube_node]
timy-rm-c1
timy-rm-c2
timy-rm-c3
timy-rm-c4
timy-rm-c5
timy-rm-c6
timy-rm-c7
[kube_ingress]
timy-rm-c1
timy-rm-c2
timy-rm-c3
timy-rm-c4
timy-rm-c5
timy-rm-c6
timy-rm-c7
[vos_generic_node]
timy-rm-c1
timy-rm-c2
timy-rm-c3
timy-rm-c4
timy-rm-c5
timy-rm-c6
timy-rm-c7
[vos_controller_node]
[vos_egress_node]
[vos_ingest_node]
[pgdb]
timy-rm-c3
timy-rm-c4
timy-rm-c5
[prometheus]
# DO NOT MODIFY BELOW UNLESS YOU KNOW WHAT YOU ARE DOING
[kube_node:children]
kube_ingress
vos_node
pgdb
prometheus
[vos_node:children]
vos_controller_node
vos_egress_node
vos_ingest_node
[vos_controller_node:children]
vos_generic_node
[vos_egress_node:children]
vos_generic_node
[vos_ingest_node:children]
vos_generic_node

Command used to invoke ansible

ansible-playbook -vb remove-node.yml -e node=timy-rm-c7 -e 'reset_nodes=false allow_ungraceful_removal=true' --flush-cache

Output of ansible run

Our modified task:

---
# Such that first etcd can be deleted
- name: Set delegated_etcd_host
  set_fact: delegated_etcd_host="{{ groups['etcd'] | difference([(node | default(''))]) | first }}"

- name: Remove etcd member from cluster
  environment:
    ETCDCTL_API: "3"
    ETCDCTL_CERT: "{{ kube_cert_dir + '/etcd/server.crt' if etcd_deployment_type == 'kubeadm' else etcd_cert_dir + '/admin-' + delegated_etcd_host + '.pem' }}"
    ETCDCTL_KEY: "{{ kube_cert_dir + '/etcd/server.key' if etcd_deployment_type == 'kubeadm' else etcd_cert_dir + '/admin-' + delegated_etcd_host + '-key.pem' }}"
    ETCDCTL_CACERT: "{{ kube_cert_dir + '/etcd/ca.crt' if etcd_deployment_type == 'kubeadm' else etcd_cert_dir + '/ca.pem' }}"
    ETCDCTL_ENDPOINTS: "https://127.0.0.1:2379"
  #delegate_to: "{{ groups['etcd'] | first }}"
  delegate_to: "{{ delegated_etcd_host }}"
  block:
  - name: Lookup members infos
    command: "{{ bin_dir }}/etcdctl member list -w json"
    register: etcd_members
    changed_when: false
    check_mode: false
    tags:
    - facts
  - name: Debug etcdctl member list
    debug:
      msg: "{{ etcd_members.stdout_lines }}"
  - name: Remove member from cluster
    command:
      argv:
      - "{{ bin_dir }}/etcdctl"
      - member
      - remove
      # Merge https://github.com/kubernetes-sigs/kubespray/commit/22fb8f8c988954eecb2ab1bf36dbfb0d13668327
      #- "{{ '%x' | format(((etcd_members.stdout | from_json).members | selectattr('peerURLs.0', '==', etcd_peer_url))[0].ID) }}"
      - "{{ '%x' | format(etcd_removed_nodes[0].ID) }}"
    vars:
      etcd_removed_nodes: "{{ (etcd_members.stdout | from_json).members | selectattr('peerURLs.0', '==', etcd_peer_url) }}"
      # This should always have at most one member, since the etcd_peer_url should be unique in the etcd cluster
    when: etcd_removed_nodes != []
    register: etcd_removal_output
    changed_when: "'Removed member' in etcd_removal_output.stdout"

Error Output:

TASK [remove-node/remove-etcd-node : Remove member from cluster] *****************************
fatal: [timy-rm-c7 -> timy-rm-c4]: FAILED! => {"msg": "The conditional check 'etcd_removed_nodes != []' failed. The error was: An unhandled exception occurred while templating '{{ (etcd_members.stdout | from_json).members | selectattr('peerURLs.0', '==', etcd_peer_url) }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while templating 'https://{{ etcd_access_address | ansible.utils.ipwrap }}:2380'. Error was a <class 'ansible.errors.AnsibleFilterError'>, original message: Unrecognized type <<class 'ansible.template.native_helpers.AnsibleUndefined'>> for ipwrap filter <value>

The error appears to be in '/home/rocky/k8s-deploy/roles/remove-node/remove-etcd-node/tasks/main.yml': line 38, column 11, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

      # This should always have at most one member, since the etcd_peer_url should be unique in the etcd cluster
    when: etcd_removed_nodes != []
          ^ here
"}

Anything else we need to know

Pre a0261a11aa4c18b53a6b733996e5df5241169314 behavior /roles/kubespray-defaults/defaults/main/main.yml

# Vars for pointing to etcd endpoints
etcd_address: "{{ ip | default(fallback_ip) }}"
etcd_access_address: "{{ access_ip | default(etcd_address) }}"
etcd_events_access_address: "{{ access_ip | default(etcd_address) }}"
etcd_peer_url: "https://{{ etcd_access_address }}:2380"

New behaviour

# Vars for pointing to etcd endpoints
etcd_address: "{{ hostvars[inventory_hostname]['main_ip'] }}"
etcd_access_address: "{{ hostvars[inventory_hostname]['main_access_ip'] }}"
etcd_events_access_address: "{{ hostvars[inventory_hostname]['main_access_ip'] }}"
etcd_peer_url: "https://{{ etcd_access_address | ansible.utils.ipwrap }}:2380"

Metadata

Metadata

Assignees

No one assigned

    Labels

    RHEL 8kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions