cluster_formation.consul.domain_suffix being ignored with use_longname

### Summary
Unable to cluster dynamically using consul service discovery with hostname lookup and "use_longname" with. I have fully troubleshooted this down to being a resolution issue (details below)

**NOTE**
All hostnames have been obfuscated away into <rabbit#>. I am aware of the significance of hostnames and/vs node names.

Also, there may be discrepency between rabbit# and a spcific IP. I had to destroy/rebuild inbetween 2a and 2b, so inconsistency is likely from that. I have progressed troubleshooting past ip addresses, reachability, etc.

### Configuration

<details>
  <summary>rabbitmq.conf</summary>

  ```java
  cluster_formation.consul.include_nodes_with_warnings = true

  cluster_formation.peer_discovery_backend        = consul
  cluster_formation.consul.host                   = <ip address>
  cluster_formation.consul.svc                    = rabbitmq
  cluster_formation.consul.svc_addr_auto          = true
  cluster_formation.consul.svc_addr_use_nodename  = true
  cluster_formation.consul.use_longname           = true
  cluster_formation.consul.scheme                 = http
  cluster_formation.consul.domain_suffix          = consul
  cluster_partition_handling                      = autoheal
  cluster_formation.node_cleanup.only_log_warning = true
  ```
</details>

<details>
  <summary>enabled_plugins</summary>

  ```java
  [rabbitmq_management,rabbitmq_peer_discovery_consul].
  ```
</details>

### Troubleshooting
#### telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

<details>
  <summary>Telnet from rabbit1 to rabbit0</summary>

  ```bash
  <rabbit1:>/# telnet <hostname rabbit0>.node.consul 5672
  Trying 10.x.x.101...
  Connected to <hostname rabbit0>.node.consul.
  Escape character is '^]'.
  12345
  Connection closed by foreign host.
  ```
</details>


<details>
  <summary>rabbitmq logs on rabbit0</summary>

  ```bash
  2022-03-02 20:30:20.638548+00:00 [info] <0.7441.0> accepting AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672)
  2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> closing AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672):
  2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> {handshake_timeout,handshake}
  ```
</details>

Same done for TCP 15672, 25672, and 4369... also in both directions.

#### 1 (Failing) Attempt to reset rabbit1 and cluster

<details>
  <summary>stop, reset, start</summary>

  ```bash
  <rabbit1:>/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
  ```
</details>

<details>
  <summary>rabbitmq logs on rabbit1 (clustering failing)</summary>

  ```bash
  2022-03-02 20:18:39.137566+00:00 [info] <0.9129.0> Running boot step database defined by app rabbit
  2022-03-02 20:18:39.138082+00:00 [info] <0.9129.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
  2022-03-02 20:18:39.138155+00:00 [info] <0.9129.0> Configured peer discovery backend: rabbit_peer_discovery_consul
  2022-03-02 20:18:39.138532+00:00 [info] <0.9129.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
  2022-03-02 20:18:39.157275+00:00 [info] <0.9129.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 20:18:39.157350+00:00 [info] <0.9129.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> Could not auto-cluster with node rabbit@<rabbit0>: {badrpc,
  2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0>                                                                             nodedown}
  ```
</details>


<details>
  <summary>/etc/resolv.conf <rabbit1></summary>

  ```java
  nameserver 169.254.1.53
  options edns0 trust-ad
  search <autosearch.domain>
  ```
</details>

**DNSMASQ Logs showing query recieved**
Note that only the hostname with the autocomplete domain is ever attempted, specifically NOT <rabbit0>.node.consul


<details>
  <summary><rabbit1> TCPDUMP showing DNS queries attempted</summary>

  ```bash
  <rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
  20:54:37.227151 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
  20:54:42.231462 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
  20:54:45.228519 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
  20:54:50.231460 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
  20:54:53.730430 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
  20:54:58.735626 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
  20:55:01.731624 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
  20:55:06.736776 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
  20:55:10.233289 IP 172.17.0.2.37892 > <rabbit1>.domain: 6287+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
  20:55:10.744937 IP <rabbit1>.domain > 172.17.0.2.34334: 61467 NXDomain 0/0/1 (70)
  20:55:10.745115 IP <rabbit1>.domain > 172.17.0.2.40696: 18513 NXDomain 0/0/1 (70)
  ```
</details>


<details>
  <summary>rabbit0.node.consul does resolve if used</summary>

  ```bash
  20:57:52.463076 IP 172.17.0.2.60604 > <rabbit1>.domain: 31859+ A? <rabbit2>.node.consul. (63)
  20:57:52.464935 IP <rabbit1>.domain > 172.17.0.2.60604: 31859* 1/0/1 A 10.x.x.101 (115)
  20:57:52.465267 IP 172.17.0.2.42540 > <rabbit1>.domain: 59909+ AAAA? <rabbit2>.node.consul. (63)
  20:57:52.466754 IP <rabbit1>.domain > 172.17.0.2.42540: 59909* 0/0/1 (99)
  ```
</details>

#### 2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

<details>
  <summary>rabbit0</summary>

  ```bash
  <rabbit0:>/# cat /etc/hosts | grep <rabbit1>
  10.x.x.82      <rabbit1>
  ```
</details>

<details>
  <summary>rabbit1</summary>

  ```bash
  <rabbit1:>/# cat /etc/hosts | grep <rabbit0>
  10.x.x.101      <rabbit0>
  ```
</details>


<details>
  <summary>Attempt to reset and cluster</summary>

  ```bash
  <rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
  ```
</details>


<details>
  <summary>rabbitmq logs on rabbit1 (clustering working)</summary>

  ```bash
  2022-03-02 20:43:56.517199+00:00 [info] <0.10846.0> Running boot step database defined by app rabbit
  2022-03-02 20:43:56.517778+00:00 [info] <0.10846.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
  2022-03-02 20:43:56.517865+00:00 [info] <0.10846.0> Configured peer discovery backend: rabbit_peer_discovery_consul
  2022-03-02 20:43:56.518024+00:00 [info] <0.10846.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
  2022-03-02 20:43:56.536229+00:00 [info] <0.10846.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 20:43:56.536303+00:00 [info] <0.10846.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 20:43:56.546445+00:00 [info] <0.10846.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
  2022-03-02 20:43:56.562305+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 20:43:56.562456+00:00 [info] <0.10846.0> Successfully synced tables from a peer
  2022-03-02 20:43:56.795333+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 20:43:56.795483+00:00 [info] <0.10846.0> Successfully synced tables from a peer
  2022-03-02 20:43:56.795533+00:00 [warn] <0.10846.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
  2022-03-02 20:43:56.795585+00:00 [warn] <0.10846.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
  2022-03-02 20:43:56.799312+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 20:43:56.799459+00:00 [info] <0.10846.0> Successfully synced tables from a peer
  2022-03-02 20:43:56.815939+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 20:43:56.816170+00:00 [info] <0.10846.0> Successfully synced tables from a peer
  2022-03-02 20:43:56.822951+00:00 [info] <0.10846.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
  2022-03-02 20:43:56.829340+00:00 [info] <0.10846.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
  2022-03-02 20:43:56.842231+00:00 [info] <0.10846.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
  2022-03-02 20:43:56.843712+00:00 [info] <0.10846.0> Will register with peer discovery backend rabbit_peer_discovery_consul
  ```
</details>

#### 2b (Working) Modify search domains to auto complete node.consul

<details>
  <summary>/etc/resolv.conf <rabbit1></summary>

  ```java
  nameserver 169.254.1.53
  options edns0 trust-ad
  search service.consul node.consul consul <autosearch.domain>
  ```
</details>


<details>
  <summary>Test DNS resolution</summary>

  ```bash
  <rabbit1>:/# nslookup <rabbit0>
  Server:         169.254.1.53
  Address:        169.254.1.53#53

  Name:   <rabbit0>.node.consul
  Address: 10.x.x.82
  ```
</details>


<details>
  <summary>Attempt to reset and cluster</summary>

  ```bash
  <rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app
  ```
</details>

<details>
  <summary><rabbit1> TCPDUMP showing DNS queries attempted</summary>

  ```bash
  <rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
  22:53:20.934533 IP 172.17.0.2.49236 > <rabbit1>.domain: 36644+ [1au] A? <rabbit0>.service.consul. (77)
  22:53:20.935991 IP <rabbit1>.domain > 172.17.0.2.49236: 36644 NXDomain* 0/1/1 (127)
  22:53:20.936080 IP 172.17.0.2.46483 > <rabbit1>.domain: 43459+ [1au] A? <rabbit0>.node.consul. (74)
  22:53:20.937571 IP <rabbit1>.domain > 172.17.0.2.46483: 43459* 1/0/2 A 10.x.x.82 (126)
  22:53:20.937908 IP 172.17.0.2.52939 > <rabbit1>.domain: 29495+ [1au] A? <rabbit0>.service.consul. (77)
  22:53:20.939311 IP <rabbit1>.domain > 172.17.0.2.52939: 29495 NXDomain* 0/1/1 (127)
  22:53:20.939427 IP 172.17.0.2.57120 > <rabbit1>.domain: 8183+ [1au] A? <rabbit0>.node.consul. (74)
  22:53:20.940935 IP <rabbit1>.domain > 172.17.0.2.57120: 8183* 1/0/2 A 10.x.x.82 (126)
```
</details>


<details>
  <summary>rabbitmq logs on <rabbit1> (clustering working)</summary>

  ```bash
  2022-03-02 22:53:20.915774+00:00 [info] <0.5139.0> Running boot step database defined by app rabbit
  2022-03-02 22:53:20.916447+00:00 [info] <0.5139.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
  2022-03-02 22:53:20.916560+00:00 [info] <0.5139.0> Configured peer discovery backend: rabbit_peer_discovery_consul
  2022-03-02 22:53:20.916709+00:00 [info] <0.5139.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
  2022-03-02 22:53:20.933669+00:00 [info] <0.5139.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 22:53:20.933723+00:00 [info] <0.5139.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
  2022-03-02 22:53:20.956197+00:00 [info] <0.5139.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
  2022-03-02 22:53:20.977140+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 22:53:20.977343+00:00 [info] <0.5139.0> Successfully synced tables from a peer
  2022-03-02 22:53:21.314738+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 22:53:21.314902+00:00 [info] <0.5139.0> Successfully synced tables from a peer
  2022-03-02 22:53:21.314948+00:00 [warn] <0.5139.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
  2022-03-02 22:53:21.314993+00:00 [warn] <0.5139.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
  2022-03-02 22:53:21.319400+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 22:53:21.319528+00:00 [info] <0.5139.0> Successfully synced tables from a peer
  2022-03-02 22:53:21.335570+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
  2022-03-02 22:53:21.335777+00:00 [info] <0.5139.0> Successfully synced tables from a peer
  2022-03-02 22:53:21.344668+00:00 [info] <0.5139.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
  2022-03-02 22:53:21.352602+00:00 [info] <0.5139.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
  2022-03-02 22:53:21.360686+00:00 [info] <0.5139.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
  2022-03-02 22:53:21.362206+00:00 [info] <0.5139.0> Will register with peer discovery backend rabbit_peer_discovery_consul
  ```
</details>

### Replication
The following Nomad job is used to build everything from stock docker images. DNS on the docker parent host has DNSMasq configured as a selective resolver with a dummy interface to forward to consul or `systemd-resolved` based on the domain.

#### Nomad Job to build Cluster
<details>
  <summary>rabbitmq.cluster.nomad</summary>

  ```terraform
  job "rabbitmq" {
    datacenters = ["us-west-2"]
    type        = "service"

    group "cluster" {
      count = 3

      update {
        max_parallel = 1
      }

      network {
        mode = "host"
        # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
        port "amqp" { static = 5672 }
        port "ui" { static = 15672 }
        port "epmd" { static = 4369 }
        port "internode" { static = 25672 }
      }

      task "rabbitmq" {
        driver = "docker"

        config {
          image    = "rabbitmq:3.9-management"
          hostname = attr.unique.hostname

          # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
          ports = ["amqp", "ui", "epmd", "internode"]

          mount {
            type     = "bind"
            source   = "local/rabbitmq.conf"
            target   = "/etc/rabbitmq/rabbitmq.conf"
            readonly = false
          }

          mount {
            type     = "bind"
            source   = "local/enabled_plugins"
            target   = "/etc/rabbitmq/enabled_plugins"
            readonly = false
          }
        }

        env {
          RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
          RABBITMQ_DEFAULT_USER  = "test"
          RABBITMQ_DEFAULT_PASS  = "test"
        }

        service {
          name = "rabbitmq-ui"
          port = "ui"
          tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]

          check {
            type     = "tcp"
            interval = "10s"
            timeout  = "2s"
          }
        }

        template {
          destination   = "local/enabled_plugins"
          change_mode   = "signal"
          change_signal = "SIGHUP"
          data          = <<-EOF
            [rabbitmq_management,rabbitmq_peer_discovery_consul].
          EOF
        }

        template {
          destination   = "local/rabbitmq.conf"
          change_mode   = "signal"
          change_signal = "SIGHUP"
          data          = <<-EOF
            # https://www.rabbitmq.com/configure.html
            # https://www.rabbitmq.com/clustering.html#node-names
            # https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
            # https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example

            cluster_formation.consul.include_nodes_with_warnings = true

            cluster_formation.peer_discovery_backend        = consul
            cluster_formation.consul.host                   = {{ env "attr.unique.network.ip-address" }}
            cluster_formation.consul.svc                    = rabbitmq
            cluster_formation.consul.svc_addr_auto          = true
            cluster_formation.consul.svc_addr_use_nodename  = true
            cluster_formation.consul.use_longname           = true
            cluster_formation.consul.scheme                 = http
            cluster_formation.consul.domain_suffix          = consul
            cluster_partition_handling                      = autoheal
            cluster_formation.node_cleanup.only_log_warning = true
          EOF
        }
      }
    }
  }
  ```
</details>

#### DNS Masq configuration

<details>
  <summary>/etc/dnsmasq.d/default</summary>

  ```java
  port=53
  server=127.0.0.53
  bind-interfaces
  ```
</details>

<details>
  <summary>/etc/dnsmasq.d/consul</summary>

  ```java
  server=/consul/169.254.1.53#8600
  listen-address=169.254.1.53
  interface=consul0
  ```
</details>


<details>
  <summary>provisioning script used to install/configure DNS Masq on Ubuntu 20.04</summary>

  ```bash
  # Make Dummy Int configs
  sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.netdev
  [NetDev]
  Name=consul0
  Kind=dummy
  EOF"

  sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.network
  [NetDev]
  [Match]
  Name=consul0

  [Network]
  Address=169.254.1.53
  EOF"

  # Restart to pick up new int
  sudo systemctl restart systemd-networkd && sleep 1

  # Install configure dnsmasq
  sudo apt-get -qq -y install dnsmasq
  sudo sed -i "s/nameserver 127.0.0.53/nameserver 169.254.1.53/" /etc/resolv.conf
  ```
</details>

### Workaround
While im not a fan of having to modify the search suffixes to include `node.consul`, that is a scalable and simple work around for the time being.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cluster_formation.consul.domain_suffix being ignored with use_longname #4220

Summary

Configuration

Troubleshooting

telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

1 (Failing) Attempt to reset rabbit1 and cluster

2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

2b (Working) Modify search domains to auto complete node.consul

Replication

Nomad Job to build Cluster

DNS Masq configuration

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cluster_formation.consul.domain_suffix being ignored with use_longname #4220

Description

Summary

Configuration

Troubleshooting

telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

1 (Failing) Attempt to reset rabbit1 and cluster

2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

2b (Working) Modify search domains to auto complete node.consul

Replication

Nomad Job to build Cluster

DNS Masq configuration

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions