Skip to content

cluster_formation.consul.domain_suffix being ignored with use_longname #4220

@namachieli

Description

@namachieli

Summary

Unable to cluster dynamically using consul service discovery with hostname lookup and "use_longname" with. I have fully troubleshooted this down to being a resolution issue (details below)

NOTE
All hostnames have been obfuscated away into <rabbit#>. I am aware of the significance of hostnames and/vs node names.

Also, there may be discrepency between rabbit# and a spcific IP. I had to destroy/rebuild inbetween 2a and 2b, so inconsistency is likely from that. I have progressed troubleshooting past ip addresses, reachability, etc.

Configuration

rabbitmq.conf
cluster_formation.consul.include_nodes_with_warnings = true

cluster_formation.peer_discovery_backend        = consul
cluster_formation.consul.host                   = <ip address>
cluster_formation.consul.svc                    = rabbitmq
cluster_formation.consul.svc_addr_auto          = true
cluster_formation.consul.svc_addr_use_nodename  = true
cluster_formation.consul.use_longname           = true
cluster_formation.consul.scheme                 = http
cluster_formation.consul.domain_suffix          = consul
cluster_partition_handling                      = autoheal
cluster_formation.node_cleanup.only_log_warning = true
enabled_plugins
[rabbitmq_management,rabbitmq_peer_discovery_consul].

Troubleshooting

telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

Telnet from rabbit1 to rabbit0
<rabbit1:>/# telnet <hostname rabbit0>.node.consul 5672
Trying 10.x.x.101...
Connected to <hostname rabbit0>.node.consul.
Escape character is '^]'.
12345
Connection closed by foreign host.
rabbitmq logs on rabbit0
2022-03-02 20:30:20.638548+00:00 [info] <0.7441.0> accepting AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672)
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> closing AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672):
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> {handshake_timeout,handshake}

Same done for TCP 15672, 25672, and 4369... also in both directions.

1 (Failing) Attempt to reset rabbit1 and cluster

stop, reset, start
<rabbit1:>/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
rabbitmq logs on rabbit1 (clustering failing)
2022-03-02 20:18:39.137566+00:00 [info] <0.9129.0> Running boot step database defined by app rabbit
2022-03-02 20:18:39.138082+00:00 [info] <0.9129.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:18:39.138155+00:00 [info] <0.9129.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:18:39.138532+00:00 [info] <0.9129.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:18:39.157275+00:00 [info] <0.9129.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:39.157350+00:00 [info] <0.9129.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> Could not auto-cluster with node rabbit@<rabbit0>: {badrpc,
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0>                                                                             nodedown}
/etc/resolv.conf
nameserver 169.254.1.53
options edns0 trust-ad
search <autosearch.domain>

DNSMASQ Logs showing query recieved
Note that only the hostname with the autocomplete domain is ever attempted, specifically NOT .node.consul

TCPDUMP showing DNS queries attempted
<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
20:54:37.227151 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:42.231462 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:45.228519 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:50.231460 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:53.730430 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:58.735626 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:01.731624 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:06.736776 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:10.233289 IP 172.17.0.2.37892 > <rabbit1>.domain: 6287+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:10.744937 IP <rabbit1>.domain > 172.17.0.2.34334: 61467 NXDomain 0/0/1 (70)
20:55:10.745115 IP <rabbit1>.domain > 172.17.0.2.40696: 18513 NXDomain 0/0/1 (70)
rabbit0.node.consul does resolve if used
20:57:52.463076 IP 172.17.0.2.60604 > <rabbit1>.domain: 31859+ A? <rabbit2>.node.consul. (63)
20:57:52.464935 IP <rabbit1>.domain > 172.17.0.2.60604: 31859* 1/0/1 A 10.x.x.101 (115)
20:57:52.465267 IP 172.17.0.2.42540 > <rabbit1>.domain: 59909+ AAAA? <rabbit2>.node.consul. (63)
20:57:52.466754 IP <rabbit1>.domain > 172.17.0.2.42540: 59909* 0/0/1 (99)

2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

rabbit0
<rabbit0:>/# cat /etc/hosts | grep <rabbit1>
10.x.x.82      <rabbit1>
rabbit1
<rabbit1:>/# cat /etc/hosts | grep <rabbit0>
10.x.x.101      <rabbit0>
Attempt to reset and cluster
<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
rabbitmq logs on rabbit1 (clustering working)
2022-03-02 20:43:56.517199+00:00 [info] <0.10846.0> Running boot step database defined by app rabbit
2022-03-02 20:43:56.517778+00:00 [info] <0.10846.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:43:56.517865+00:00 [info] <0.10846.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:43:56.518024+00:00 [info] <0.10846.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:43:56.536229+00:00 [info] <0.10846.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.536303+00:00 [info] <0.10846.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.546445+00:00 [info] <0.10846.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 20:43:56.562305+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.562456+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795333+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.795483+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795533+00:00 [warn] <0.10846.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 20:43:56.795585+00:00 [warn] <0.10846.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 20:43:56.799312+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.799459+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.815939+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.816170+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.822951+00:00 [info] <0.10846.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.829340+00:00 [info] <0.10846.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.842231+00:00 [info] <0.10846.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.843712+00:00 [info] <0.10846.0> Will register with peer discovery backend rabbit_peer_discovery_consul

2b (Working) Modify search domains to auto complete node.consul

/etc/resolv.conf
nameserver 169.254.1.53
options edns0 trust-ad
search service.consul node.consul consul <autosearch.domain>
Test DNS resolution
<rabbit1>:/# nslookup <rabbit0>
Server:         169.254.1.53
Address:        169.254.1.53#53

Name:   <rabbit0>.node.consul
Address: 10.x.x.82
Attempt to reset and cluster
<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
TCPDUMP showing DNS queries attempted
<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
22:53:20.934533 IP 172.17.0.2.49236 > <rabbit1>.domain: 36644+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.935991 IP <rabbit1>.domain > 172.17.0.2.49236: 36644 NXDomain* 0/1/1 (127)
22:53:20.936080 IP 172.17.0.2.46483 > <rabbit1>.domain: 43459+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.937571 IP <rabbit1>.domain > 172.17.0.2.46483: 43459* 1/0/2 A 10.x.x.82 (126)
22:53:20.937908 IP 172.17.0.2.52939 > <rabbit1>.domain: 29495+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.939311 IP <rabbit1>.domain > 172.17.0.2.52939: 29495 NXDomain* 0/1/1 (127)
22:53:20.939427 IP 172.17.0.2.57120 > <rabbit1>.domain: 8183+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.940935 IP <rabbit1>.domain > 172.17.0.2.57120: 8183* 1/0/2 A 10.x.x.82 (126)
rabbitmq logs on (clustering working)
2022-03-02 22:53:20.915774+00:00 [info] <0.5139.0> Running boot step database defined by app rabbit
2022-03-02 22:53:20.916447+00:00 [info] <0.5139.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 22:53:20.916560+00:00 [info] <0.5139.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 22:53:20.916709+00:00 [info] <0.5139.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 22:53:20.933669+00:00 [info] <0.5139.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.933723+00:00 [info] <0.5139.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.956197+00:00 [info] <0.5139.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 22:53:20.977140+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:20.977343+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314738+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.314902+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314948+00:00 [warn] <0.5139.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 22:53:21.314993+00:00 [warn] <0.5139.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 22:53:21.319400+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.319528+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.335570+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.335777+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.344668+00:00 [info] <0.5139.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.352602+00:00 [info] <0.5139.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.360686+00:00 [info] <0.5139.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.362206+00:00 [info] <0.5139.0> Will register with peer discovery backend rabbit_peer_discovery_consul

Replication

The following Nomad job is used to build everything from stock docker images. DNS on the docker parent host has DNSMasq configured as a selective resolver with a dummy interface to forward to consul or systemd-resolved based on the domain.

Nomad Job to build Cluster

rabbitmq.cluster.nomad
job "rabbitmq" {
  datacenters = ["us-west-2"]
  type        = "service"

  group "cluster" {
    count = 3

    update {
      max_parallel = 1
    }

    network {
      mode = "host"
      # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
      port "amqp" { static = 5672 }
      port "ui" { static = 15672 }
      port "epmd" { static = 4369 }
      port "internode" { static = 25672 }
    }

    task "rabbitmq" {
      driver = "docker"

      config {
        image    = "rabbitmq:3.9-management"
        hostname = attr.unique.hostname

        # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
        ports = ["amqp", "ui", "epmd", "internode"]

        mount {
          type     = "bind"
          source   = "local/rabbitmq.conf"
          target   = "/etc/rabbitmq/rabbitmq.conf"
          readonly = false
        }

        mount {
          type     = "bind"
          source   = "local/enabled_plugins"
          target   = "/etc/rabbitmq/enabled_plugins"
          readonly = false
        }
      }

      env {
        RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
        RABBITMQ_DEFAULT_USER  = "test"
        RABBITMQ_DEFAULT_PASS  = "test"
      }

      service {
        name = "rabbitmq-ui"
        port = "ui"
        tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      template {
        destination   = "local/enabled_plugins"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          [rabbitmq_management,rabbitmq_peer_discovery_consul].
        EOF
      }

      template {
        destination   = "local/rabbitmq.conf"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          # https://www.rabbitmq.com/configure.html
          # https://www.rabbitmq.com/clustering.html#node-names
          # https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
          # https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example

          cluster_formation.consul.include_nodes_with_warnings = true

          cluster_formation.peer_discovery_backend        = consul
          cluster_formation.consul.host                   = {{ env "attr.unique.network.ip-address" }}
          cluster_formation.consul.svc                    = rabbitmq
          cluster_formation.consul.svc_addr_auto          = true
          cluster_formation.consul.svc_addr_use_nodename  = true
          cluster_formation.consul.use_longname           = true
          cluster_formation.consul.scheme                 = http
          cluster_formation.consul.domain_suffix          = consul
          cluster_partition_handling                      = autoheal
          cluster_formation.node_cleanup.only_log_warning = true
        EOF
      }
    }
  }
}

DNS Masq configuration

/etc/dnsmasq.d/default
port=53
server=127.0.0.53
bind-interfaces
/etc/dnsmasq.d/consul
server=/consul/169.254.1.53#8600
listen-address=169.254.1.53
interface=consul0
provisioning script used to install/configure DNS Masq on Ubuntu 20.04
# Make Dummy Int configs
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.netdev
[NetDev]
Name=consul0
Kind=dummy
EOF"

sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.network
[NetDev]
[Match]
Name=consul0

[Network]
Address=169.254.1.53
EOF"

# Restart to pick up new int
sudo systemctl restart systemd-networkd && sleep 1

# Install configure dnsmasq
sudo apt-get -qq -y install dnsmasq
sudo sed -i "s/nameserver 127.0.0.53/nameserver 169.254.1.53/" /etc/resolv.conf

Workaround

While im not a fan of having to modify the search suffixes to include node.consul, that is a scalable and simple work around for the time being.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions