-
Notifications
You must be signed in to change notification settings - Fork 4k
Inability to view nodes in management UI upgrading from 3.12 to 3.13 (Docker Swarm, Consul clustering) #10661
Replies: 3 comments · 23 replies
-
We cannot suggest much without seeing logs from all nodes. We do not guess in this community. After upgrades, management UI users usually must reset their browser cache. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Virtual hosts cannot be "distributed" across nodes, a virtual host exists on all cluster nodes at once. CLI and HTTP API operations that create a virtual host explicitly wait for its state to be initialized on all nodes before returning. RabbitMQ supports multiple queue leader replica placement strategies for classic queues. For quorum queues there's a lot less control because Raft leader election by design does not guarantee that a specific node may be selected. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you for sharing the logs! I see that Consul returns an empty list for like 27 seconds, then a list with only node 3 in it until the 30 seconds timeout. Because the list doesn't have node 1, it continues as a standalone node after the timeout. Do you have logs from 3.12 to compare? Perhaps a longer peer discovery timeout would help here. If you perform the same HTTP query as RabbitMQ once all nodes have booted, what does it return? |
Beta Was this translation helpful? Give feedback.
All reactions
-
@dumbbell I shared the API result from Consul when all nodes were up. I might not have understood your point completely; I would appreciate it if you could clarify a bit. Both in versions 3.12 and 3.13, the nodes come up roughly at the same time. In version 3.12, it also receives a timeout error on the first request, but on the subsequent request, it can retrieve the configurations and create the cluster. What can I do for a v3.12 rabbitmq-02 logs
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Based on your reply, this is probably not a timeout problem. I don’t know how Consul works but it looks like the reply to the HTTP query changes over time. Do you know the condition(s) after which a node is returned by Consul? For example, does Consul returns the nodes that have already started only? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Let me expand my answer a bit. The peer discovery logic changed significantly in 3.13 (see #9797). In particular, it now requires that the list of nodes returned by the peer discovery backend contains the node that queried it. That’s why I’m asking how the discovered nodes list grows. |
Beta Was this translation helpful? Give feedback.
All reactions
-
In both scenarios, when the Consul service starts, I receive empty array |
Beta Was this translation helpful? Give feedback.
All reactions
-
@michaelklishin @dumbbell I resolved the issue by adding the following two lines to
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
The confusing part for me is that it starts with a GET request in v3.13, while in v3.12, it starts with a PUT request. In v3.13, if it's trying to fetch RabbitMQ services on the node before registering itself, Consul will return empty array (as we can see). The service that completes the 60 attempts first will then register itself with a PUT request (I can see that it sends a PUT request after the GET requests are completed), allowing the other services still attempting to reach it. I think this could be the reason why |
Beta Was this translation helpful? Give feedback.
All reactions
-
That’s interesting. From what you say, it appears that Consul adds a node to its list after the node issued that PUT request. I looked at the docs of Consul Sessions which is what is behind that PUT. A session is required to acquire a lock in Consul. The RabbitMQ Consul peer discovery backend abuses the This was fine in 3.12 because the lock was always acquired before querying discovered nodes. Thus the session was created. In 3.13, the lock is only acquired if the node wants to join another one (thus after discovered nodes were queried). This is not the case for the first node that joins nobody. This fixes locking issues that 3.12 had. However the Consul backend breaks because there isn’t the side effect of There is a design problem around that Consul backend that I need to think about. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
So, the configuration I thought of as a solution to the problem is not actually a solution. If there's an issue opened about this, I'd like to follow it. I hope that it won't lead to the discontinuation of support for Consul. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@dumbbell in the worst case scenario we can make the Consul implementation re-register all known nodes one way or another. There aren't clusters with 100s of nodes, so this would be wasteful but should still work. Another idea I have is registering that way up until we reach the target node count hint, something we have originally designed for less trivial cases where knowing the expected number of nodes might help. |
Beta Was this translation helpful? Give feedback.
All reactions
-
For the record, I create issue #10760 to track this problem. I'm working on a patch, but need to be careful because the peer discovery workflow has to change a bit. |
Beta Was this translation helpful? Give feedback.
All reactions
This discussion was converted from issue #10661 on March 04, 2024 14:23.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
After upgrading from 3.12 to 3.13 and using same configuration (clustering with Consul in Docker), the nodes are no longer visible in the management UI. Additionally, when creating virtual hosts, they are only created on a single node so I can reach them randomly. Refreshing the management UI results in a change in the clustering name (e.g., from rabbitmq-01 to rabbitmq-02).
Reproduction steps
I use
haproxy:2.8
andconsul:1.15
and I did not change their versions or configurations in either scenario.The steps I followed:
(3.12-management)
with Docker Swarm.(3.13-management)
with same config file.Config file:
Expected behavior
I expected:
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions