-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Summary
Unable to cluster dynamically using consul service discovery with hostname lookup and "use_longname" with. I have fully troubleshooted this down to being a resolution issue (details below)
NOTE
All hostnames have been obfuscated away into <rabbit#>. I am aware of the significance of hostnames and/vs node names.
Also, there may be discrepency between rabbit# and a spcific IP. I had to destroy/rebuild inbetween 2a and 2b, so inconsistency is likely from that. I have progressed troubleshooting past ip addresses, reachability, etc.
Configuration
rabbitmq.conf
cluster_formation.consul.include_nodes_with_warnings = true
cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = <ip address>
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr_auto = true
cluster_formation.consul.svc_addr_use_nodename = true
cluster_formation.consul.use_longname = true
cluster_formation.consul.scheme = http
cluster_formation.consul.domain_suffix = consul
cluster_partition_handling = autoheal
cluster_formation.node_cleanup.only_log_warning = trueenabled_plugins
[rabbitmq_management,rabbitmq_peer_discovery_consul].Troubleshooting
telnet to 5672 to confirm rabbit containers can open connections, confirmed in logs
Telnet from rabbit1 to rabbit0
<rabbit1:>/# telnet <hostname rabbit0>.node.consul 5672
Trying 10.x.x.101...
Connected to <hostname rabbit0>.node.consul.
Escape character is '^]'.
12345
Connection closed by foreign host.rabbitmq logs on rabbit0
2022-03-02 20:30:20.638548+00:00 [info] <0.7441.0> accepting AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672)
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> closing AMQP connection <0.7441.0> (10.x.x.82:44964 -> 172.17.0.2:5672):
2022-03-02 20:30:25.146542+00:00 [erro] <0.7441.0> {handshake_timeout,handshake}Same done for TCP 15672, 25672, and 4369... also in both directions.
1 (Failing) Attempt to reset rabbit1 and cluster
stop, reset, start
<rabbit1:>/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_apprabbitmq logs on rabbit1 (clustering failing)
2022-03-02 20:18:39.137566+00:00 [info] <0.9129.0> Running boot step database defined by app rabbit
2022-03-02 20:18:39.138082+00:00 [info] <0.9129.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:18:39.138155+00:00 [info] <0.9129.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:18:39.138532+00:00 [info] <0.9129.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:18:39.157275+00:00 [info] <0.9129.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:39.157350+00:00 [info] <0.9129.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> Could not auto-cluster with node rabbit@<rabbit0>: {badrpc,
2022-03-02 20:18:47.158279+00:00 [warn] <0.9129.0> nodedown}/etc/resolv.conf
nameserver 169.254.1.53
options edns0 trust-ad
search <autosearch.domain>DNSMASQ Logs showing query recieved
Note that only the hostname with the autocomplete domain is ever attempted, specifically NOT .node.consul
TCPDUMP showing DNS queries attempted
<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
20:54:37.227151 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:42.231462 IP 172.17.0.2.53032 > <rabbit1>.domain: 46741+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:45.228519 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:50.231460 IP 172.17.0.2.34334 > <rabbit1>.domain: 61467+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:54:53.730430 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:54:58.735626 IP 172.17.0.2.51814 > <rabbit1>.domain: 53238+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:01.731624 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:06.736776 IP 172.17.0.2.40696 > <rabbit1>.domain: 18513+ [1au] A? <rabbit0>.<autosearch.domain>. (70)
20:55:10.233289 IP 172.17.0.2.37892 > <rabbit1>.domain: 6287+ [1au] A? <rabbit2>.<autosearch.domain>. (70)
20:55:10.744937 IP <rabbit1>.domain > 172.17.0.2.34334: 61467 NXDomain 0/0/1 (70)
20:55:10.745115 IP <rabbit1>.domain > 172.17.0.2.40696: 18513 NXDomain 0/0/1 (70)rabbit0.node.consul does resolve if used
20:57:52.463076 IP 172.17.0.2.60604 > <rabbit1>.domain: 31859+ A? <rabbit2>.node.consul. (63)
20:57:52.464935 IP <rabbit1>.domain > 172.17.0.2.60604: 31859* 1/0/1 A 10.x.x.101 (115)
20:57:52.465267 IP 172.17.0.2.42540 > <rabbit1>.domain: 59909+ AAAA? <rabbit2>.node.consul. (63)
20:57:52.466754 IP <rabbit1>.domain > 172.17.0.2.42540: 59909* 0/0/1 (99)2a (Working) add entries to /etc/hosts for rabbit0 and rabbit1 hostnames mutually
rabbit0
<rabbit0:>/# cat /etc/hosts | grep <rabbit1>
10.x.x.82 <rabbit1>rabbit1
<rabbit1:>/# cat /etc/hosts | grep <rabbit0>
10.x.x.101 <rabbit0>Attempt to reset and cluster
<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_apprabbitmq logs on rabbit1 (clustering working)
2022-03-02 20:43:56.517199+00:00 [info] <0.10846.0> Running boot step database defined by app rabbit
2022-03-02 20:43:56.517778+00:00 [info] <0.10846.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 20:43:56.517865+00:00 [info] <0.10846.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 20:43:56.518024+00:00 [info] <0.10846.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 20:43:56.536229+00:00 [info] <0.10846.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.536303+00:00 [info] <0.10846.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 20:43:56.546445+00:00 [info] <0.10846.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 20:43:56.562305+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.562456+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795333+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.795483+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.795533+00:00 [warn] <0.10846.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 20:43:56.795585+00:00 [warn] <0.10846.0> Feature flags: - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 20:43:56.799312+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.799459+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.815939+00:00 [info] <0.10846.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 20:43:56.816170+00:00 [info] <0.10846.0> Successfully synced tables from a peer
2022-03-02 20:43:56.822951+00:00 [info] <0.10846.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.829340+00:00 [info] <0.10846.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.842231+00:00 [info] <0.10846.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 20:43:56.843712+00:00 [info] <0.10846.0> Will register with peer discovery backend rabbit_peer_discovery_consul2b (Working) Modify search domains to auto complete node.consul
/etc/resolv.conf
nameserver 169.254.1.53
options edns0 trust-ad
search service.consul node.consul consul <autosearch.domain>Test DNS resolution
<rabbit1>:/# nslookup <rabbit0>
Server: 169.254.1.53
Address: 169.254.1.53#53
Name: <rabbit0>.node.consul
Address: 10.x.x.82Attempt to reset and cluster
<rabbit1>:/# rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_appTCPDUMP showing DNS queries attempted
<rabbit1>:~$ sudo tcpdump udp port 53 --interface docker0
22:53:20.934533 IP 172.17.0.2.49236 > <rabbit1>.domain: 36644+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.935991 IP <rabbit1>.domain > 172.17.0.2.49236: 36644 NXDomain* 0/1/1 (127)
22:53:20.936080 IP 172.17.0.2.46483 > <rabbit1>.domain: 43459+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.937571 IP <rabbit1>.domain > 172.17.0.2.46483: 43459* 1/0/2 A 10.x.x.82 (126)
22:53:20.937908 IP 172.17.0.2.52939 > <rabbit1>.domain: 29495+ [1au] A? <rabbit0>.service.consul. (77)
22:53:20.939311 IP <rabbit1>.domain > 172.17.0.2.52939: 29495 NXDomain* 0/1/1 (127)
22:53:20.939427 IP 172.17.0.2.57120 > <rabbit1>.domain: 8183+ [1au] A? <rabbit0>.node.consul. (74)
22:53:20.940935 IP <rabbit1>.domain > 172.17.0.2.57120: 8183* 1/0/2 A 10.x.x.82 (126)rabbitmq logs on (clustering working)
2022-03-02 22:53:20.915774+00:00 [info] <0.5139.0> Running boot step database defined by app rabbit
2022-03-02 22:53:20.916447+00:00 [info] <0.5139.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@<rabbit1> is empty. Assuming we need to join an existing cluster or initialise from scratch...
2022-03-02 22:53:20.916560+00:00 [info] <0.5139.0> Configured peer discovery backend: rabbit_peer_discovery_consul
2022-03-02 22:53:20.916709+00:00 [info] <0.5139.0> Will try to lock with peer discovery backend rabbit_peer_discovery_consul
2022-03-02 22:53:20.933669+00:00 [info] <0.5139.0> All discovered existing cluster peers: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.933723+00:00 [info] <0.5139.0> Peer nodes we can cluster with: rabbit@<rabbit0>, rabbit@<rabbit2>
2022-03-02 22:53:20.956197+00:00 [info] <0.5139.0> Node 'rabbit@<rabbit0>' selected for auto-clustering
2022-03-02 22:53:20.977140+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:20.977343+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314738+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.314902+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.314948+00:00 [warn] <0.5139.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/var/lib/rabbitmq/mnesia/rabbit@<rabbit1>-feature_flags`:
2022-03-02 22:53:21.314993+00:00 [warn] <0.5139.0> Feature flags: - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-02 22:53:21.319400+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.319528+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.335570+00:00 [info] <0.5139.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-03-02 22:53:21.335777+00:00 [info] <0.5139.0> Successfully synced tables from a peer
2022-03-02 22:53:21.344668+00:00 [info] <0.5139.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.352602+00:00 [info] <0.5139.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.360686+00:00 [info] <0.5139.0> Setting up a table for per-user connection counting on this node: 'tracked_connection_table_per_user_on_node_rabbit@<rabbit1>'
2022-03-02 22:53:21.362206+00:00 [info] <0.5139.0> Will register with peer discovery backend rabbit_peer_discovery_consulReplication
The following Nomad job is used to build everything from stock docker images. DNS on the docker parent host has DNSMasq configured as a selective resolver with a dummy interface to forward to consul or systemd-resolved based on the domain.
Nomad Job to build Cluster
rabbitmq.cluster.nomad
job "rabbitmq" {
datacenters = ["us-west-2"]
type = "service"
group "cluster" {
count = 3
update {
max_parallel = 1
}
network {
mode = "host"
# https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
port "amqp" { static = 5672 }
port "ui" { static = 15672 }
port "epmd" { static = 4369 }
port "internode" { static = 25672 }
}
task "rabbitmq" {
driver = "docker"
config {
image = "rabbitmq:3.9-management"
hostname = attr.unique.hostname
# https://stackoverflow.com/questions/63601913/nomad-and-port-mapping
ports = ["amqp", "ui", "epmd", "internode"]
mount {
type = "bind"
source = "local/rabbitmq.conf"
target = "/etc/rabbitmq/rabbitmq.conf"
readonly = false
}
mount {
type = "bind"
source = "local/enabled_plugins"
target = "/etc/rabbitmq/enabled_plugins"
readonly = false
}
}
env {
RABBITMQ_ERLANG_COOKIE = "ADUMMYSTRINGFORNOW"
RABBITMQ_DEFAULT_USER = "test"
RABBITMQ_DEFAULT_PASS = "test"
}
service {
name = "rabbitmq-ui"
port = "ui"
tags = ["rabbitmq-ui", "urlprefix-/rabbitmq-ui"]
check {
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
template {
destination = "local/enabled_plugins"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
[rabbitmq_management,rabbitmq_peer_discovery_consul].
EOF
}
template {
destination = "local/rabbitmq.conf"
change_mode = "signal"
change_signal = "SIGHUP"
data = <<-EOF
# https://www.rabbitmq.com/configure.html
# https://www.rabbitmq.com/clustering.html#node-names
# https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
# https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/docs/rabbitmq.conf.example
cluster_formation.consul.include_nodes_with_warnings = true
cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = {{ env "attr.unique.network.ip-address" }}
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr_auto = true
cluster_formation.consul.svc_addr_use_nodename = true
cluster_formation.consul.use_longname = true
cluster_formation.consul.scheme = http
cluster_formation.consul.domain_suffix = consul
cluster_partition_handling = autoheal
cluster_formation.node_cleanup.only_log_warning = true
EOF
}
}
}
}DNS Masq configuration
/etc/dnsmasq.d/default
port=53
server=127.0.0.53
bind-interfaces/etc/dnsmasq.d/consul
server=/consul/169.254.1.53#8600
listen-address=169.254.1.53
interface=consul0provisioning script used to install/configure DNS Masq on Ubuntu 20.04
# Make Dummy Int configs
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.netdev
[NetDev]
Name=consul0
Kind=dummy
EOF"
sudo sh -c "cat <<EOF >> /etc/systemd/network/consul0.network
[NetDev]
[Match]
Name=consul0
[Network]
Address=169.254.1.53
EOF"
# Restart to pick up new int
sudo systemctl restart systemd-networkd && sleep 1
# Install configure dnsmasq
sudo apt-get -qq -y install dnsmasq
sudo sed -i "s/nameserver 127.0.0.53/nameserver 169.254.1.53/" /etc/resolv.confWorkaround
While im not a fan of having to modify the search suffixes to include node.consul, that is a scalable and simple work around for the time being.