cluster_formation.consul.domain_suffix being ignored with use_longname #4229
-
SummaryUnable to cluster dynamically using consul service discovery with hostname lookup and "use_longname" with. I have fully troubleshooted this down to being a resolution issue (details below) NOTE Also, there may be discrepency between rabbit# and a spcific IP. I had to destroy/rebuild inbetween 2a and 2b, so inconsistency is likely from that. I have progressed troubleshooting past ip addresses, reachability, etc. Configurationrabbitmq.confcluster_formation.consul.include_nodes_with_warnings = true
cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = <ip address>
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr_auto = true
cluster_formation.consul.svc_addr_use_nodename = true
cluster_formation.consul.use_longname = true
cluster_formation.consul.scheme = http
cluster_formation.consul.domain_suffix = consul
cluster_partition_handling = autoheal
cluster_formation.node_cleanup.only_log_warning = true enabled_plugins[rabbitmq_management,rabbitmq_peer_discovery_consul]. Troubleshootingtelnet to 5672 to confirm rabbit containers can open connections, confirmed in logsTelnet from rabbit1 to rabbit0<rabbit1:>/# telnet <hostname rabbit0>.node.consul 5672
Trying 10.x.x.101...
Connected to <hostname rabbit0>.node.consul.
Escape character is '^]'.
12345
Connection closed by foreign host. rabbitmq logs on rabbit02022-03-02 20:30:20.638548+00:00 [info] <0.7441.0> accepting AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672)
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> closing AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672):
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> {handshake_timeout,handshake} Same done for TCP 15672, 25672, and 4369... also in both directions. 1 (Failing) Attempt to reset rabbit1 and clusterstop, reset, start<rabbit1:>/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app rabbitmq logs on rabbit1 (clustering failing)2022-03-02 20:18:39.137566+00:00 [info] <0.9129.0> Running boot step database defined by app rabbit
2022-03-02 20:18:39.138082+00:00 [info] <0.9129.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:18:39.138155+00:00 [info] <0.9129.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:18:39.138532+00:00 [info] <0.9129.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:18:39.157275+00:00 [info] <0.9129.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:39.157350+00:00 [info] <0.9129.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> Could not auto-cluster with node rabbit@<rabbit0>: {badrpc,
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> nodedown} /etc/resolv.confnameserver 169.254.1.53
options edns0 trust-ad
search <autosearch.domain> DNSMASQ Logs showing query recieved TCPDUMP showing DNS queries attempted<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
20:54:37.227151 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:42.231462 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:45.228519 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:50.231460 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:53.730430 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:58.735626 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:01.731624 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:06.736776 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:10.233289 IP 172.17.0.2.37892 > <rabbit1>.domain: 6287+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:10.744937 IP <rabbit1>.domain > 172.17.0.2.34334: 61467 NXDomain 0/0/1 (70)
20:55:10.745115 IP <rabbit1>.domain > 172.17.0.2.40696: 18513 NXDomain 0/0/1 (70) rabbit0.node.consul does resolve if used20:57:52.463076 IP 172.17.0.2.60604 > <rabbit1>.domain: 31859+ A? <rabbit2>.node.consul. (63)
20:57:52.464935 IP <rabbit1>.domain > 172.17.0.2.60604: 31859* 1/0/1 A 10.x.x.101 (115)
20:57:52.465267 IP 172.17.0.2.42540 > <rabbit1>.domain: 59909+ AAAA? <rabbit2>.node.consul. (63)
20:57:52.466754 IP <rabbit1>.domain > 172.17.0.2.42540: 59909* 0/0/1 (99) 2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutuallyrabbit0<rabbit0:>/# cat /etc/hosts | grep <rabbit1>
10.x.x.82 <rabbit1> rabbit1<rabbit1:>/# cat /etc/hosts | grep <rabbit0>
10.x.x.101 <rabbit0> Attempt to reset and cluster<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app rabbitmq logs on rabbit1 (clustering working)2022-03-02 20:43:56.517199+00:00 [info] <0.10846.0> Running boot step database defined by app rabbit
2022-03-02 20:43:56.517778+00:00 [info] <0.10846.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:43:56.517865+00:00 [info] <0.10846.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:43:56.518024+00:00 [info] <0.10846.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:43:56.536229+00:00 [info] <0.10846.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.536303+00:00 [info] <0.10846.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.546445+00:00 [info] <0.10846.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 20:43:56.562305+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.562456+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795333+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.795483+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795533+00:00 [warn] <0.10846.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 20:43:56.795585+00:00 [warn] <0.10846.0> Feature flags: - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 20:43:56.799312+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.799459+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.815939+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.816170+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.822951+00:00 [info] <0.10846.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.829340+00:00 [info] <0.10846.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.842231+00:00 [info] <0.10846.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.843712+00:00 [info] <0.10846.0> Will register with peer discovery backend rabbit_peer_discovery_consul 2b (Working) Modify search domains to auto complete node.consul/etc/resolv.confnameserver 169.254.1.53
options edns0 trust-ad
search service.consul node.consul consul <autosearch.domain> Test DNS resolution<rabbit1>:/# nslookup <rabbit0>
Server: 169.254.1.53
Address: 169.254.1.53#53
Name: <rabbit0>.node.consul
Address: 10.x.x.82 Attempt to reset and cluster<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app TCPDUMP showing DNS queries attempted<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
22:53:20.934533 IP 172.17.0.2.49236 > <rabbit1>.domain: 36644+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.935991 IP <rabbit1>.domain > 172.17.0.2.49236: 36644 NXDomain* 0/1/1 (127)
22:53:20.936080 IP 172.17.0.2.46483 > <rabbit1>.domain: 43459+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.937571 IP <rabbit1>.domain > 172.17.0.2.46483: 43459* 1/0/2 A 10.x.x.82 (126)
22:53:20.937908 IP 172.17.0.2.52939 > <rabbit1>.domain: 29495+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.939311 IP <rabbit1>.domain > 172.17.0.2.52939: 29495 NXDomain* 0/1/1 (127)
22:53:20.939427 IP 172.17.0.2.57120 > <rabbit1>.domain: 8183+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.940935 IP <rabbit1>.domain > 172.17.0.2.57120: 8183* 1/0/2 A 10.x.x.82 (126) rabbitmq logs on (clustering working)2022-03-02 22:53:20.915774+00:00 [info] <0.5139.0> Running boot step database defined by app rabbit
2022-03-02 22:53:20.916447+00:00 [info] <0.5139.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 22:53:20.916560+00:00 [info] <0.5139.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 22:53:20.916709+00:00 [info] <0.5139.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 22:53:20.933669+00:00 [info] <0.5139.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.933723+00:00 [info] <0.5139.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.956197+00:00 [info] <0.5139.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 22:53:20.977140+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:20.977343+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314738+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.314902+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314948+00:00 [warn] <0.5139.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 22:53:21.314993+00:00 [warn] <0.5139.0> Feature flags: - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 22:53:21.319400+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.319528+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.335570+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.335777+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.344668+00:00 [info] <0.5139.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.352602+00:00 [info] <0.5139.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.360686+00:00 [info] <0.5139.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.362206+00:00 [info] <0.5139.0> Will register with peer discovery backend rabbit_peer_discovery_consul ReplicationThe following Nomad job is used to build everything from stock docker images. DNS on the docker parent host has DNSMasq configured as a selective resolver with a dummy interface to forward to consul or Nomad Job to build Clusterrabbitmq.cluster.nomadjob "rabbitmq" {
datacenters = ["us-west-2"]
type = "service"
group "cluster" {
count = 3
update {
max_parallel = 1
}
network {
mode = "host"
# https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
port "amqp" { static = 5672 }
port "ui" { static = 15672 }
port "epmd" { static = 4369 }
port "internode" { static = 25672 }
}
task "rabbitmq" {
driver = "docker"
config {
image = "rabbitmq:3.9-management"
hostname = attr.unique.hostname
# https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
ports = ["amqp", "ui", "epmd", "internode"]
mount {
type = "bind"
source = "local/rabbitmq.conf"
target = "/etc/rabbitmq/rabbitmq.conf"
readonly = false
}
mount {
type = "bind"
source = "local/enabled_plugins"
target = "/etc/rabbitmq/enabled_plugins"
readonly = false
}
}
env {
RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
RABBITMQ_DEFAULT_USER = "test"
RABBITMQ_DEFAULT_PASS = "test"
}
service {
name = "rabbitmq-ui"
port = "ui"
tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]
check {
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
template {
destination = "local/enabled_plugins"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
[rabbitmq_management,rabbitmq_peer_discovery_consul].
EOF
}
template {
destination = "local/rabbitmq.conf"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
# https://www.rabbitmq.com/configure.html
# https://www.rabbitmq.com/clustering.html#node-names
# https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
# https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example
cluster_formation.consul.include_nodes_with_warnings = true
cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = {{ env "attr.unique.network.ip-address" }}
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr_auto = true
cluster_formation.consul.svc_addr_use_nodename = true
cluster_formation.consul.use_longname = true
cluster_formation.consul.scheme = http
cluster_formation.consul.domain_suffix = consul
cluster_partition_handling = autoheal
cluster_formation.node_cleanup.only_log_warning = true
EOF
}
}
}
} DNS Masq configuration/etc/dnsmasq.d/defaultport=53
server=127.0.0.53
bind-interfaces /etc/dnsmasq.d/consulserver=/consul/169.254.1.53#8600
listen-address=169.254.1.53
interface=consul0 provisioning script used to install/configure DNS Masq on Ubuntu 20.04# Make Dummy Int configs
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.netdev
[NetDev]
Name=consul0
Kind=dummy
EOF"
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.network
[NetDev]
[Match]
Name=consul0
[Network]
Address=169.254.1.53
EOF"
# Restart to pick up new int
sudo systemctl restart systemd-networkd && sleep 1
# Install configure dnsmasq
sudo apt-get -qq -y install dnsmasq
sudo sed -i "s/nameserver 127.0.0.53/nameserver 169.254.1.53/" /etc/resolv.conf WorkaroundWhile im not a fan of having to modify the search suffixes to include |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 24 replies
-
Thanks for the comprehensive set of information. However, I don't see where you have set https://www.rabbitmq.com/clustering.html#node-names I am converting this to a discussion as there is as of yet no evidence of a bug. We can convert back to an issue if necessary. GitHub will close and lock this issue but will provide a link to the discussion. |
Beta Was this translation helpful? Give feedback.
-
So I've added the discussed file
And after resetting, this STILL is not working. Looking at the attempted resolution, the configured parameters for consul in
The line Shown below is that this is never attempted. appended is
I also tried setting the env variable directly with I am likely missing some config element somewhere or I misunderstand how this works, but it is not apparent to me from scouring documentation. What DOES work however, is simply adding Hopefully this helps someone. |
Beta Was this translation helpful? Give feedback.
-
@namachieli here is a relatively simple example using Docker Compose, Consul and long names: https://github.com/lukebakken/docker-rabbitmq-cluster/blob/master/docker-compose.yml Next step is to see about getting |
Beta Was this translation helpful? Give feedback.
-
@namachieli actually I think I have a theory as to what is going on. When you build your RabbitMQ cluster, you are not explicitly setting the node names, which means that RabbitMQ computes them from what the system-level network functions return. The following shows what happens, in essence, when you set
Note that
If you do NOT explicitly specify the RabbitMQ node name (via In the case of my example project, the docker containers running RabbitMQ have "short" host names set up so I must explicitly use In your case, the short names work because you append the DNS suffix for registering the services in Consul as well as in your DNS system. I'm going to open a PR against your repo with some changes. I can't guarantee they'll work because I'm not 100% sure how everything fits together. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
For the sake of parity and when the linked repo is inevitably destroyed, Hopefully this helps the next person... This is the Nomad job spec that successful loads and clusters using Consul. Nomad Job Sec for RabbitMq Cluster with Consul
job "rabbitmq" {
datacenters = ["us-west-2"]
type = "service"
group "cluster" {
count = 3
update {
max_parallel = 1
}
network {
mode = "host"
# https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
port "amqp" { static = 5672 }
port "ui" { static = 15672 }
port "epmd" { static = 4369 }
port "internode" { static = 25672 }
}
task "rabbitmq" {
driver = "docker"
config {
image = "rabbitmq:3.9-management"
hostname = attr.unique.hostname
ports = ["amqp", "ui", "epmd", "internode"]
mount {
type = "bind"
source = "local/rabbitmq-env.conf"
target = "/etc/rabbitmq/rabbitmq-env.conf"
readonly = true
}
mount {
type = "bind"
source = "local/rabbitmq.conf"
target = "/etc/rabbitmq/rabbitmq.conf"
readonly = true
}
mount {
type = "bind"
source = "local/enabled_plugins"
target = "/etc/rabbitmq/enabled_plugins"
readonly = false
}
}
env {
RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
RABBITMQ_DEFAULT_USER = "test"
RABBITMQ_DEFAULT_PASS = "test"
RABBITMQ_USE_LONGNAME = true # https://github.com/rabbitmq/rabbitmq-server/discussions/4229
}
service {
name = "rabbitmq-ui"
port = "ui"
tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]
check {
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
template {
destination = "local/enabled_plugins"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
[rabbitmq_management,rabbitmq_peer_discovery_consul].
EOF
}
template {
destination = "local/rabbitmq-env.conf"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
USE_LONGNAME=true
NODENAME="rabbit@$(hostname).node.consul"
EOF
}
template {
destination = "local/rabbitmq.conf"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
# https://www.rabbitmq.com/configure.html
# https://www.rabbitmq.com/clustering.html#node-names
# https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
# https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example
cluster_partition_handling = autoheal
cluster_formation.consul.include_nodes_with_warnings = true
cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = {{ env "attr.unique.network.ip-address" }}
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr_auto = true
cluster_formation.consul.svc_addr_use_nodename = true
cluster_formation.consul.use_longname = true
cluster_formation.consul.scheme = http
cluster_formation.node_cleanup.only_log_warning = true
EOF
}
}
}
} |
Beta Was this translation helpful? Give feedback.
Thanks for the comprehensive set of information. However, I don't see where you have set
RABBITMQ_USE_LONGNAME=true
in the environment, orUSE_LONGNAME=true
inrabbitmq-env.conf
-https://www.rabbitmq.com/clustering.html#node-names
I am converting this to a discussion as there is as of yet no evidence of a bug. We can convert back to an issue if necessary. GitHub will close and lock this issue but will provide a link to the discussion.