Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743

Ayanda-D · 2025-10-16T16:19:30Z

Proposed Changes

Hello Team! 👋

We're seeing some peer-discovery issues in 4.x, which we've managed to fix with this patch.

When a new node starts-up and rabbit_peer_discovery:sync_desired_cluster() is called (via rabbit_db bootstep), a number of attempts are carried out to connect to the configured cluster nodes as per discovery_retry_limit and discovery_retry_interval, which can hang for long periods of time if peer nodes are unreachable (e.g. 7 node cluster, up to 30-minutes using default DEFAULT_DISCOVERY_RETRY_COUNT=30 and DEFAULT_DISCOVERY_RETRY_INTERVAL_MS=1000).

The peer node connection attempts in sync_desired_cluster/0 make use of an erpc_call/5 which in turn uses a 10-second timeout when nodes are unreachable. This further delays each connection attempt beyond desired configured limits and intervals. Only single connection attempts on these erpc calls are only required at this point during boot when sync_desired_cluster/0 is carried out.

Reproduce with (using 6 fake/non-existent peer nodes):

gmake run-broker

> application:set_env(rabbit, cluster_nodes, ['rabbit@host1', 'rabbit@host2', 'rabbit@host3', 'rabbit@host4', 'rabbit@host5', 'rabbit@host6']).

> application:set_env(rabbit, cluster_formation, [{peer_discovery_backend, rabbit_peer_discovery_classic_config}, {discovery_retry_limit, 5}, {discovery_retry_interval, 10}]).

> rabbit_peer_discovery:sync_desired_cluster().

Each erpc_call/5 call imposes an additional 10-second delay, per connection retry, per un-reachable node.

We want these peer discovery to respect discovery_retry_limit and discovery_retry_interval configs and the node attempting connectivity to proceed as standalone node, i.e get to this point at the expected time:

2025-10-16 16:31:19.842369+01:00 [error] <0.138.0> Peer discovery: could not discover and join another node; proceeding as a standalone node

Also, if a node is first reachable then becomes unreachable on query_node_props2, calls like this below, imply an unavoidable 10-second wait for each erpc_call/5 call i.e. 40-seconds in total:

get_node_start_time(Node, FromNode) ->
    try
        erpc_call(Node,rabbit_boot_state, get_start_time, [], FromNode)
    catch
        error:{exception, _, _} ->
            NativeStartTime = erpc_call(
                                Node, erlang, system_info, [start_time],
                                FromNode),
            TimeOffset = erpc_call(Node, erlang, time_offset, [], FromNode),
            SystemStartTime = NativeStartTime + TimeOffset,
            StartTime = erpc_call(
                          Node, erlang, convert_time_unit,
                          [SystemStartTime, native, microsecond], FromNode),
            StartTime
    end.

As a result this is making rabbitmq-4.x unusable in some environments with such cluster rollouts. This patch removes this 10-second timeout, sets it to 0 to allow peer discover to respect the configured discovery_retry_limit and discovery_retry_interval.

NOTE: We only see a 10-second timeout necessary and useful when node-props are being acquired from successfully connected nodes, on the first erpc_call - if this takes longer than 10-seconds then something definitely wrong and we cant proceed. Although we could also remove this 10-second wait as well.

Please take a look, we are keen to having this bug-fix available to enable use of 4.x in certain environments.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
This is simply a reminder of what we are going to look for before merging your code.

Mandatory: I (or my employer/client) have have signed the CA (see https://github.com/rabbitmq/cla)
I have read the CONTRIBUTING.md document
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

…er-nodes are carried out only once on each configured rabbit.discovery_retry_limit loop. The erpc_call/5 abstracts a maximum 10-second timeout which can further delay connection attempts beyond desired configured limits and intervals for nodes which are remain unreacheable. Only single connection attempts are required at this point, i.e. on the rabbit_peer_discovery:sync_desired_cluster/1 phase.

and unreacheable configured cluster nodes

Ayanda-D added 2 commits October 16, 2025 16:04

update comment in peer_discovery_classic_config_SUITE on nonexistent

3e1ad6d

and unreacheable configured cluster nodes

michaelklishin merged commit 98d963b into rabbitmq:main Oct 18, 2025
290 of 291 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743

Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743

Uh oh!

Ayanda-D commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743

Unblock peer discovery on boot when attempting connectivity to unreachable cluster nodes #14743

Uh oh!

Conversation

Ayanda-D commented Oct 16, 2025

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants