cluster_formation.consul.domain_suffix being ignored with use_longname #4229

namachieli · 2022-03-02T23:58:41Z

namachieli
Mar 2, 2022

Summary

Unable to cluster dynamically using consul service discovery with hostname lookup and "use_longname" with. I have fully troubleshooted this down to being a resolution issue (details below)

NOTE
All hostnames have been obfuscated away into <rabbit#>. I am aware of the significance of hostnames and/vs node names.

Also, there may be discrepency between rabbit# and a spcific IP. I had to destroy/rebuild inbetween 2a and 2b, so inconsistency is likely from that. I have progressed troubleshooting past ip addresses, reachability, etc.

Configuration

rabbitmq.conf

cluster_formation.consul.include_nodes_with_warnings = true

cluster_formation.peer_discovery_backend        = consul
cluster_formation.consul.host                   = <ip address>
cluster_formation.consul.svc                    = rabbitmq
cluster_formation.consul.svc_addr_auto          = true
cluster_formation.consul.svc_addr_use_nodename  = true
cluster_formation.consul.use_longname           = true
cluster_formation.consul.scheme                 = http
cluster_formation.consul.domain_suffix          = consul
cluster_partition_handling                      = autoheal
cluster_formation.node_cleanup.only_log_warning = true

enabled_plugins

[rabbitmq_management,rabbitmq_peer_discovery_consul].

Troubleshooting

telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

Telnet from rabbit1 to rabbit0

<rabbit1:>/# telnet <hostname rabbit0>.node.consul 5672
Trying 10.x.x.101...
Connected to <hostname rabbit0>.node.consul.
Escape character is '^]'.
12345
Connection closed by foreign host.

rabbitmq logs on rabbit0

2022-03-02 20:30:20.638548+00:00 [info] <0.7441.0> accepting AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672)
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> closing AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672):
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> {handshake_timeout,handshake}

Same done for TCP 15672, 25672, and 4369... also in both directions.

1 (Failing) Attempt to reset rabbit1 and cluster

stop, reset, start

<rabbit1:>/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app

rabbitmq logs on rabbit1 (clustering failing)

2022-03-02 20:18:39.137566+00:00 [info] <0.9129.0> Running boot step database defined by app rabbit
2022-03-02 20:18:39.138082+00:00 [info] <0.9129.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:18:39.138155+00:00 [info] <0.9129.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:18:39.138532+00:00 [info] <0.9129.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:18:39.157275+00:00 [info] <0.9129.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:39.157350+00:00 [info] <0.9129.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> Could not auto-cluster with node rabbit@<rabbit0>: {badrpc,
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0>                                                                             nodedown}

/etc/resolv.conf

nameserver 169.254.1.53
options edns0 trust-ad
search <autosearch.domain>

DNSMASQ Logs showing query recieved
Note that only the hostname with the autocomplete domain is ever attempted, specifically NOT .node.consul

TCPDUMP showing DNS queries attempted

<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
20:54:37.227151 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:42.231462 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:45.228519 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:50.231460 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:53.730430 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:58.735626 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:01.731624 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:06.736776 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:10.233289 IP 172.17.0.2.37892 > <rabbit1>.domain: 6287+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:10.744937 IP <rabbit1>.domain > 172.17.0.2.34334: 61467 NXDomain 0/0/1 (70)
20:55:10.745115 IP <rabbit1>.domain > 172.17.0.2.40696: 18513 NXDomain 0/0/1 (70)

rabbit0.node.consul does resolve if used

20:57:52.463076 IP 172.17.0.2.60604 > <rabbit1>.domain: 31859+ A? <rabbit2>.node.consul. (63)
20:57:52.464935 IP <rabbit1>.domain > 172.17.0.2.60604: 31859* 1/0/1 A 10.x.x.101 (115)
20:57:52.465267 IP 172.17.0.2.42540 > <rabbit1>.domain: 59909+ AAAA? <rabbit2>.node.consul. (63)
20:57:52.466754 IP <rabbit1>.domain > 172.17.0.2.42540: 59909* 0/0/1 (99)

2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

rabbit0

<rabbit0:>/# cat /etc/hosts | grep <rabbit1>
10.x.x.82      <rabbit1>

rabbit1

<rabbit1:>/# cat /etc/hosts | grep <rabbit0>
10.x.x.101      <rabbit0>

Attempt to reset and cluster

<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app

rabbitmq logs on rabbit1 (clustering working)

2022-03-02 20:43:56.517199+00:00 [info] <0.10846.0> Running boot step database defined by app rabbit
2022-03-02 20:43:56.517778+00:00 [info] <0.10846.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:43:56.517865+00:00 [info] <0.10846.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:43:56.518024+00:00 [info] <0.10846.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:43:56.536229+00:00 [info] <0.10846.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.536303+00:00 [info] <0.10846.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.546445+00:00 [info] <0.10846.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 20:43:56.562305+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.562456+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795333+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.795483+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795533+00:00 [warn] <0.10846.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 20:43:56.795585+00:00 [warn] <0.10846.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 20:43:56.799312+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.799459+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.815939+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.816170+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.822951+00:00 [info] <0.10846.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.829340+00:00 [info] <0.10846.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.842231+00:00 [info] <0.10846.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.843712+00:00 [info] <0.10846.0> Will register with peer discovery backend rabbit_peer_discovery_consul

2b (Working) Modify search domains to auto complete node.consul

/etc/resolv.conf

nameserver 169.254.1.53
options edns0 trust-ad
search service.consul node.consul consul <autosearch.domain>

Test DNS resolution

<rabbit1>:/# nslookup <rabbit0>
Server:         169.254.1.53
Address:        169.254.1.53#53

Name:   <rabbit0>.node.consul
Address: 10.x.x.82

Attempt to reset and cluster

<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app

TCPDUMP showing DNS queries attempted

<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
22:53:20.934533 IP 172.17.0.2.49236 > <rabbit1>.domain: 36644+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.935991 IP <rabbit1>.domain > 172.17.0.2.49236: 36644 NXDomain* 0/1/1 (127)
22:53:20.936080 IP 172.17.0.2.46483 > <rabbit1>.domain: 43459+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.937571 IP <rabbit1>.domain > 172.17.0.2.46483: 43459* 1/0/2 A 10.x.x.82 (126)
22:53:20.937908 IP 172.17.0.2.52939 > <rabbit1>.domain: 29495+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.939311 IP <rabbit1>.domain > 172.17.0.2.52939: 29495 NXDomain* 0/1/1 (127)
22:53:20.939427 IP 172.17.0.2.57120 > <rabbit1>.domain: 8183+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.940935 IP <rabbit1>.domain > 172.17.0.2.57120: 8183* 1/0/2 A 10.x.x.82 (126)

rabbitmq logs on (clustering working)

2022-03-02 22:53:20.915774+00:00 [info] <0.5139.0> Running boot step database defined by app rabbit
2022-03-02 22:53:20.916447+00:00 [info] <0.5139.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 22:53:20.916560+00:00 [info] <0.5139.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 22:53:20.916709+00:00 [info] <0.5139.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 22:53:20.933669+00:00 [info] <0.5139.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.933723+00:00 [info] <0.5139.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.956197+00:00 [info] <0.5139.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 22:53:20.977140+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:20.977343+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314738+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.314902+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314948+00:00 [warn] <0.5139.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 22:53:21.314993+00:00 [warn] <0.5139.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 22:53:21.319400+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.319528+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.335570+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.335777+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.344668+00:00 [info] <0.5139.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.352602+00:00 [info] <0.5139.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.360686+00:00 [info] <0.5139.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.362206+00:00 [info] <0.5139.0> Will register with peer discovery backend rabbit_peer_discovery_consul

Replication

The following Nomad job is used to build everything from stock docker images. DNS on the docker parent host has DNSMasq configured as a selective resolver with a dummy interface to forward to consul or systemd-resolved based on the domain.

Nomad Job to build Cluster

rabbitmq.cluster.nomad

job "rabbitmq" {
  datacenters = ["us-west-2"]
  type        = "service"

  group "cluster" {
    count = 3

    update {
      max_parallel = 1
    }

    network {
      mode = "host"
      # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
      port "amqp" { static = 5672 }
      port "ui" { static = 15672 }
      port "epmd" { static = 4369 }
      port "internode" { static = 25672 }
    }

    task "rabbitmq" {
      driver = "docker"

      config {
        image    = "rabbitmq:3.9-management"
        hostname = attr.unique.hostname

        # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
        ports = ["amqp", "ui", "epmd", "internode"]

        mount {
          type     = "bind"
          source   = "local/rabbitmq.conf"
          target   = "/etc/rabbitmq/rabbitmq.conf"
          readonly = false
        }

        mount {
          type     = "bind"
          source   = "local/enabled_plugins"
          target   = "/etc/rabbitmq/enabled_plugins"
          readonly = false
        }
      }

      env {
        RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
        RABBITMQ_DEFAULT_USER  = "test"
        RABBITMQ_DEFAULT_PASS  = "test"
      }

      service {
        name = "rabbitmq-ui"
        port = "ui"
        tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      template {
        destination   = "local/enabled_plugins"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          [rabbitmq_management,rabbitmq_peer_discovery_consul].
        EOF
      }

      template {
        destination   = "local/rabbitmq.conf"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          # https://www.rabbitmq.com/configure.html
          # https://www.rabbitmq.com/clustering.html#node-names
          # https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
          # https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example

          cluster_formation.consul.include_nodes_with_warnings = true

          cluster_formation.peer_discovery_backend        = consul
          cluster_formation.consul.host                   = {{ env "attr.unique.network.ip-address" }}
          cluster_formation.consul.svc                    = rabbitmq
          cluster_formation.consul.svc_addr_auto          = true
          cluster_formation.consul.svc_addr_use_nodename  = true
          cluster_formation.consul.use_longname           = true
          cluster_formation.consul.scheme                 = http
          cluster_formation.consul.domain_suffix          = consul
          cluster_partition_handling                      = autoheal
          cluster_formation.node_cleanup.only_log_warning = true
        EOF
      }
    }
  }
}

DNS Masq configuration

/etc/dnsmasq.d/default

port=53
server=127.0.0.53
bind-interfaces

/etc/dnsmasq.d/consul

server=/consul/169.254.1.53#8600
listen-address=169.254.1.53
interface=consul0

provisioning script used to install/configure DNS Masq on Ubuntu 20.04

# Make Dummy Int configs
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.netdev
[NetDev]
Name=consul0
Kind=dummy
EOF"

sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.network
[NetDev]
[Match]
Name=consul0

[Network]
Address=169.254.1.53
EOF"

# Restart to pick up new int
sudo systemctl restart systemd-networkd && sleep 1

# Install configure dnsmasq
sudo apt-get -qq -y install dnsmasq
sudo sed -i "s/nameserver 127.0.0.53/nameserver 169.254.1.53/" /etc/resolv.conf

Workaround

While im not a fan of having to modify the search suffixes to include node.consul, that is a scalable and simple work around for the time being.

Answered by lukebakken

Mar 3, 2022

Thanks for the comprehensive set of information. However, I don't see where you have set RABBITMQ_USE_LONGNAME=true in the environment, or USE_LONGNAME=true in rabbitmq-env.conf -

https://www.rabbitmq.com/clustering.html#node-names

I am converting this to a discussion as there is as of yet no evidence of a bug. We can convert back to an issue if necessary. GitHub will close and lock this issue but will provide a link to the discussion.

View full answer

lukebakken · 2022-03-03T15:39:28Z

lukebakken
Mar 3, 2022
Maintainer

Thanks for the comprehensive set of information. However, I don't see where you have set RABBITMQ_USE_LONGNAME=true in the environment, or USE_LONGNAME=true in rabbitmq-env.conf -

https://www.rabbitmq.com/clustering.html#node-names

I am converting this to a discussion as there is as of yet no evidence of a bug. We can convert back to an issue if necessary. GitHub will close and lock this issue but will provide a link to the discussion.

8 replies

lukebakken Mar 3, 2022
Maintainer

What is the reason that one parameter needs to be defined twice in two different locations?

That's just how it is. USE_LONGNAME tells the Erlang VM to use FQDN names when doing host resolution for distributed Erlang - https://www.erlang.org/doc/reference_manual/distributed.html

The cluster_formation.consul.use_longname setting only configures the Consul peer discovery plugin. I suppose the latter could read the global setting. I'll create an issue for that.

lukebakken Mar 3, 2022
Maintainer

Are there other params that need to be defined in both locations?

Probably not, but we're here to assist if you run into something else.

namachieli Mar 3, 2022
Author

Cool, well at least I have the answer I need to progress :) Thank you for the assistance!

lukebakken Mar 3, 2022
Maintainer

rabbitmq/rabbitmq-website#1366

#4230

Thanks again for providing such a complete set of information for us to help you.

namachieli Mar 3, 2022
Author

Thank you for being a responsive and active maintainer! :)

namachieli · 2022-03-03T19:03:19Z

namachieli
Mar 3, 2022
Author

So I've added the discussed file

:/# cat /etc/rabbitmq/rabbitmq-env.conf 
USE_LONGNAME=true

And after resetting, this STILL is not working. Looking at the attempted resolution, the configured parameters for consul in /etc/rabbitmq/rabbitmq.conf are still not used, and is using '.localdomain' rather than what is stated in https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul

Node Name Suffixes
If node name is computed and long node names are used, it is possible to append a 
suffix to node names retrieved from Consul. The format is .node.{domain_suffix}. 
This can be useful in environments with DNS conventions, e.g. when all service 
nodes are organized in a separate subdomain.

The line cluster_formation.consul.domain_suffix = consul should be instructing this plugin to append .node.{domain_suffix} which is configured as consul (ie: hostname.node.consul)

Shown below is that this is never attempted. appended is .localdomain and then the search domain in /etc/resolv.conf

:~$ sudo tcpdump udp port 53 --interface docker0
18:40:35.270542 IP 172.17.0.2.55689 > <>.domain: 7733+ [1au] A? <>.localdomain. (65)
18:40:36.187860 IP 172.17.0.2.55756 > <>.domain: 57314+ [1au] A?<>.<searchdomain>. (61)
18:40:36.907767 IP <>.domain > 172.17.0.2.56537: 10459 NXDomain 0/0/1 (61)
18:40:36.907921 IP <>.domain > 172.17.0.2.36319: 47695 NXDomain 0/0/1 (61)

I also tried setting the env variable directly with export RABBITMQ_USE_LONGNAME=true just to be sure, same result (as expected).

I am likely missing some config element somewhere or I misunderstand how this works, but it is not apparent to me from scouring documentation.

What DOES work however, is simply adding domain node.consul to /etc/resolv.conf just after search <domain>, and ALSO not including USE_LONGNAME=true in /etc/rabbitmq/rabbitmq-env.conf

Hopefully this helps someone.

11 replies

lukebakken Mar 4, 2022
Maintainer

OK thanks. That is quite the complicated setup! I'll let you know how it goes. I think I'm going to start with my "simple local" setup to see if I can reproduce it with a minimum number of moving parts.

namachieli Mar 4, 2022
Author

Sounds good.. Yeah it is complicated for a single RabbitMQ Cluster... but not for what it is actually doing when its not stripped down :)

lukebakken Mar 8, 2022
Maintainer

I've been working on the .NET client but this issue is the next in my queue. Thanks for being patient.

namachieli Mar 8, 2022
Author

Thank you for being excellent, no rush.

lukebakken Mar 10, 2022
Maintainer

Keep an eye here - https://github.com/lukebakken/rabbitmq-server-4229

I'm in the process of getting an env set up locally. I'll script everything out.

lukebakken · 2022-03-14T19:38:28Z

lukebakken
Mar 14, 2022
Maintainer

@namachieli here is a relatively simple example using Docker Compose, Consul and long names:

https://github.com/lukebakken/docker-rabbitmq-cluster/blob/master/docker-compose.yml

Next step is to see about getting domain_suffix in there.

0 replies

lukebakken · 2022-03-14T19:57:48Z

lukebakken
Mar 14, 2022
Maintainer

@namachieli actually I think I have a theory as to what is going on. When you build your RabbitMQ cluster, you are not explicitly setting the node names, which means that RabbitMQ computes them from what the system-level network functions return.

The following shows what happens, in essence, when you set USE_LONGNAME to true (the equivalent of the -name argument below) but the OS does not return a FQDN to the Erlang VM. The following is from my local workstation:

$ ping shostakovich.localdomain
PING shostakovich.localdomain (192.168.1.5) 56(84) bytes of data.
64 bytes from shostakovich.localdomain (192.168.1.5): icmp_seq=1 ttl=64 time=0.018 ms

$ erl -name [email protected]
Erlang/OTP 24 [erts-12.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V12.3  (abort with ^G)
([email protected])1> net_adm:localhost().
"shostakovich"

Note that net_adm:localhost() returns the short host name, which is what my system is set to:

$ hostname
shostakovich

If you do NOT explicitly specify the RabbitMQ node name (via NODENAME=... in rabbitmq-env.conf or the RABBITMQ_NODENAME=... env var) it will default to whatever the system returns. My guess is in your environment they are the short names. Note that the Consul plugin does NOT set your node names! It only provides a means for nodes to discover each other.

In the case of my example project, the docker containers running RabbitMQ have "short" host names set up so I must explicitly use NODENAME=rabbit@rabbitmq... in rabbitmq-env.conf to set the correct long name. These long names work because docker's internal DNS system resolves them.

In your case, the short names work because you append the DNS suffix for registering the services in Consul as well as in your DNS system.

I'm going to open a PR against your repo with some changes. I can't guarantee they'll work because I'm not 100% sure how everything fits together.

5 replies

namachieli Mar 15, 2022
Author

Ok, I'm pretty sure I follow. I could see how being explicit here is preferable over allowing discovery processes to guess.

namachieli Mar 15, 2022
Author

Oh, and I will take your fixes provided and try them in my environment tomorrow (and disable the dns workaround) and report back. I forgot that etiquette dictates I should have mentioned I would do that 😄

ip-sf Mar 15, 2022

I've added your changes to my environment (i used the job exactly as it rests in your PR), and everything works as you expected. The cluster forms quickly with no issue!
I also made sure to disable domain node.consul before testing, and then rebuilt the whole environment fresh.

I'm curious about why the following config line doesn't work as I would have expected, per the documentation?
cluster_formation.consul.domain_suffix = consul (disabled for verification per your changes)

Is it likely a bug with the consul discovery plugin? Or is it because it only adds to the node name that when left to auto discovery, doesn't form correctly?

If the latter I would assume I could set the node name without FQDN (node.consul) and then reenable domain_suffix and it may work that way?

lukebakken Mar 15, 2022
Maintainer

Now that I have a better understanding of how all these pieces fit together...

cluster_formation.consul.domain_suffix = consul

Here's the purpose of the above setting. If your DNS is configured such that .consul is the "TLD" for your domain, and your nodes are named such that their names do NOT end in .consul, and you're using long node names, the setting is necessary.

Or, if you're using short node names but for some weird reason they won't resolve in DNS without a suffix. I.e. node rabbit@rabbit1 tries to look up rabbit@rabbit2 but rabbit2 is really rabbit2.consul in DNS. Hm even then it won't work because you'd have to add .consul in your DNS search domains.

I think it makes much more sense for your nodes to use the full FQDN name that will resolve in your DNS system and skip the domain_suffix setting altogether.

This has been a very useful exercise for me because I've never delved deep into the cluster peer discovery plugins. I forgot that these plugins do not name your nodes but only provide a way to look up other cluster members.

Your nodes can only be named from the hostname on which they run or via the RABBITMQ_NODENAME variable. Full stop!

Thanks and have a great rest of your week.

ip-sf Mar 15, 2022

Fantastic insight, thank you. This was very helpful for me as well. I appreciate the discourse and the explanations!

I think we can call this ordeal resolved.

lukebakken · 2022-03-14T20:13:08Z

lukebakken
Mar 14, 2022
Maintainer

namachieli/hashistack-rabbitmq-discussion-4229#1

0 replies

ip-sf · 2022-03-15T20:26:31Z

ip-sf
Mar 15, 2022

For the sake of parity and when the linked repo is inevitably destroyed, Hopefully this helps the next person...

This is the Nomad job spec that successful loads and clusters using Consul.

Nomad Job Sec for RabbitMq Cluster with Consul

Nomad 1.2.6
Consul 1.11.4
RabbitMQ 3.9.13
Erlang 24.3
RabbitMQ Docker rabbitmq:3.9-management
Nomad Host: Ubuntu 20.04 based from Official AMI ami-0892d3c7ee96c0bf7

job "rabbitmq" {
  datacenters = ["us-west-2"]
  type        = "service"

  group "cluster" {
    count = 3

    update {
      max_parallel = 1
    }

    network {
      mode = "host"
      # https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
      port "amqp" { static = 5672 }
      port "ui" { static = 15672 }
      port "epmd" { static = 4369 }
      port "internode" { static = 25672 }
    }

    task "rabbitmq" {
      driver = "docker"

      config {
        image    = "rabbitmq:3.9-management"
        hostname = attr.unique.hostname

        ports = ["amqp", "ui", "epmd", "internode"]

        mount {
          type     = "bind"
          source   = "local/rabbitmq-env.conf"
          target   = "/etc/rabbitmq/rabbitmq-env.conf"
          readonly = true
        }

        mount {
          type     = "bind"
          source   = "local/rabbitmq.conf"
          target   = "/etc/rabbitmq/rabbitmq.conf"
          readonly = true
        }

        mount {
          type     = "bind"
          source   = "local/enabled_plugins"
          target   = "/etc/rabbitmq/enabled_plugins"
          readonly = false
        }
      }

      env {
        RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
        RABBITMQ_DEFAULT_USER  = "test"
        RABBITMQ_DEFAULT_PASS  = "test"
        RABBITMQ_USE_LONGNAME  = true # https://github.com/rabbitmq/rabbitmq-server/discussions/4229
      }

      service {
        name = "rabbitmq-ui"
        port = "ui"
        tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      template {
        destination   = "local/enabled_plugins"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          [rabbitmq_management,rabbitmq_peer_discovery_consul].
        EOF
      }

      template {
        destination   = "local/rabbitmq-env.conf"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          USE_LONGNAME=true
          NODENAME="rabbit@$(hostname).node.consul"
        EOF
      }

      template {
        destination   = "local/rabbitmq.conf"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        data          = <<-EOF
          # https://www.rabbitmq.com/configure.html
          # https://www.rabbitmq.com/clustering.html#node-names
          # https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
          # https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example
          cluster_partition_handling                           = autoheal
          cluster_formation.consul.include_nodes_with_warnings = true
          cluster_formation.peer_discovery_backend             = consul
          cluster_formation.consul.host                        = {{ env "attr.unique.network.ip-address" }}
          cluster_formation.consul.svc                         = rabbitmq
          cluster_formation.consul.svc_addr_auto               = true
          cluster_formation.consul.svc_addr_use_nodename       = true
          cluster_formation.consul.use_longname                = true
          cluster_formation.consul.scheme                      = http
          cluster_formation.node_cleanup.only_log_warning      = true
        EOF
      }
    }
  }
}

0 replies

cluster_formation.consul.domain_suffix being ignored with use_longname #4229

Uh oh!

namachieli Mar 2, 2022

Summary

Configuration

Troubleshooting

telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs

1 (Failing) Attempt to reset rabbit1 and cluster

2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually

2b (Working) Modify search domains to auto complete node.consul

Replication

Nomad Job to build Cluster

DNS Masq configuration

Workaround

Replies: 6 comments · 24 replies

Uh oh!

Uh oh!

lukebakken Mar 3, 2022 Maintainer

Uh oh!

lukebakken Mar 3, 2022 Maintainer

Uh oh!

lukebakken Mar 3, 2022 Maintainer

Uh oh!

namachieli Mar 3, 2022 Author

Uh oh!

lukebakken Mar 3, 2022 Maintainer

Uh oh!

namachieli Mar 3, 2022 Author

Uh oh!

Uh oh!

namachieli Mar 3, 2022 Author

Uh oh!

lukebakken Mar 4, 2022 Maintainer

Uh oh!

namachieli Mar 4, 2022 Author

Uh oh!

lukebakken Mar 8, 2022 Maintainer

Uh oh!

namachieli Mar 8, 2022 Author

Uh oh!

lukebakken Mar 10, 2022 Maintainer

Uh oh!

lukebakken Mar 14, 2022 Maintainer

Uh oh!

Uh oh!

lukebakken Mar 14, 2022 Maintainer

Uh oh!

namachieli Mar 15, 2022 Author

Uh oh!

namachieli Mar 15, 2022 Author

Uh oh!

Uh oh!

ip-sf Mar 15, 2022

Uh oh!

lukebakken Mar 15, 2022 Maintainer

Uh oh!

ip-sf Mar 15, 2022

Uh oh!

lukebakken Mar 14, 2022 Maintainer

Uh oh!

ip-sf Mar 15, 2022

namachieli
Mar 2, 2022

Replies: 6 comments 24 replies

lukebakken
Mar 3, 2022
Maintainer

lukebakken Mar 3, 2022
Maintainer

lukebakken Mar 3, 2022
Maintainer

namachieli Mar 3, 2022
Author

lukebakken Mar 3, 2022
Maintainer

namachieli Mar 3, 2022
Author

namachieli
Mar 3, 2022
Author

lukebakken Mar 4, 2022
Maintainer

namachieli Mar 4, 2022
Author

lukebakken Mar 8, 2022
Maintainer

namachieli Mar 8, 2022
Author

lukebakken Mar 10, 2022
Maintainer

lukebakken
Mar 14, 2022
Maintainer

lukebakken
Mar 14, 2022
Maintainer

namachieli Mar 15, 2022
Author

namachieli Mar 15, 2022
Author

lukebakken Mar 15, 2022
Maintainer

lukebakken
Mar 14, 2022
Maintainer

ip-sf
Mar 15, 2022