Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed Changes
Hello Team! 👋
We're seeing some peer-discovery issues in 4.x, which we've managed to fix with this patch.
When a new node starts-up and
rabbit_peer_discovery:sync_desired_cluster()is called (viarabbit_dbbootstep), a number of attempts are carried out to connect to the configured cluster nodes as perdiscovery_retry_limitanddiscovery_retry_interval, which can hang for long periods of time if peer nodes are unreachable (e.g. 7 node cluster, up to 30-minutes using defaultDEFAULT_DISCOVERY_RETRY_COUNT=30andDEFAULT_DISCOVERY_RETRY_INTERVAL_MS=1000).The peer node connection attempts in
sync_desired_cluster/0make use of anerpc_call/5which in turn uses a 10-second timeout when nodes are unreachable. This further delays each connection attempt beyond desired configured limits and intervals. Only single connection attempts on theseerpccalls are only required at this point during boot whensync_desired_cluster/0is carried out.Reproduce with (using 6 fake/non-existent peer nodes):
Each
erpc_call/5call imposes an additional 10-second delay, per connection retry, per un-reachable node.We want these peer discovery to respect
discovery_retry_limitanddiscovery_retry_intervalconfigs and the node attempting connectivity to proceed as standalone node, i.e get to this point at the expected time:Also, if a node is first reachable then becomes unreachable on
query_node_props2, calls like this below, imply an unavoidable 10-second wait for eacherpc_call/5call i.e. 40-seconds in total:As a result this is making rabbitmq-4.x unusable in some environments with such cluster rollouts. This patch removes this 10-second timeout, sets it to
0to allow peer discover to respect the configureddiscovery_retry_limitanddiscovery_retry_interval.NOTE: We only see a 10-second timeout necessary and useful when node-props are being acquired from successfully connected nodes, on the first erpc_call - if this takes longer than 10-seconds then something definitely wrong and we cant proceed. Although we could also remove this 10-second wait as well.
Please take a look, we are keen to having this bug-fix available to enable use of 4.x in certain environments.
Types of Changes
What types of changes does your code introduce to this project?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply.You can also fill these out after creating the PR.
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.mddocumentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.