Skip to content

Conversation

@Ayanda-D
Copy link
Contributor

Proposed Changes

Hello Team! 👋

We're seeing some peer-discovery issues in 4.x, which we've managed to fix with this patch.

When a new node starts-up and rabbit_peer_discovery:sync_desired_cluster() is called (via rabbit_db bootstep), a number of attempts are carried out to connect to the configured cluster nodes as per discovery_retry_limit and discovery_retry_interval, which can hang for long periods of time if peer nodes are unreachable (e.g. 7 node cluster, up to 30-minutes using default DEFAULT_DISCOVERY_RETRY_COUNT=30 and DEFAULT_DISCOVERY_RETRY_INTERVAL_MS=1000).

The peer node connection attempts in sync_desired_cluster/0 make use of an erpc_call/5 which in turn uses a 10-second timeout when nodes are unreachable. This further delays each connection attempt beyond desired configured limits and intervals. Only single connection attempts on these erpc calls are only required at this point during boot when sync_desired_cluster/0 is carried out.

Reproduce with (using 6 fake/non-existent peer nodes):

gmake run-broker

> application:set_env(rabbit, cluster_nodes, ['rabbit@host1', 'rabbit@host2', 'rabbit@host3', 'rabbit@host4', 'rabbit@host5', 'rabbit@host6']).

> application:set_env(rabbit, cluster_formation, [{peer_discovery_backend, rabbit_peer_discovery_classic_config}, {discovery_retry_limit, 5}, {discovery_retry_interval, 10}]).

> rabbit_peer_discovery:sync_desired_cluster().

Each erpc_call/5 call imposes an additional 10-second delay, per connection retry, per un-reachable node.

We want these peer discovery to respect discovery_retry_limit and discovery_retry_interval configs and the node attempting connectivity to proceed as standalone node, i.e get to this point at the expected time:

2025-10-16 16:31:19.842369+01:00 [error] <0.138.0> Peer discovery: could not discover and join another node; proceeding as a standalone node

Also, if a node is first reachable then becomes unreachable on query_node_props2, calls like this below, imply an unavoidable 10-second wait for each erpc_call/5 call i.e. 40-seconds in total:

get_node_start_time(Node, FromNode) ->
    try
        erpc_call(Node,rabbit_boot_state, get_start_time, [], FromNode)
    catch
        error:{exception, _, _} ->
            NativeStartTime = erpc_call(
                                Node, erlang, system_info, [start_time],
                                FromNode),
            TimeOffset = erpc_call(Node, erlang, time_offset, [], FromNode),
            SystemStartTime = NativeStartTime + TimeOffset,
            StartTime = erpc_call(
                          Node, erlang, convert_time_unit,
                          [SystemStartTime, native, microsecond], FromNode),
            StartTime
    end.

As a result this is making rabbitmq-4.x unusable in some environments with such cluster rollouts. This patch removes this 10-second timeout, sets it to 0 to allow peer discover to respect the configured discovery_retry_limit and discovery_retry_interval.

NOTE: We only see a 10-second timeout necessary and useful when node-props are being acquired from successfully connected nodes, on the first erpc_call - if this takes longer than 10-seconds then something definitely wrong and we cant proceed. Although we could also remove this 10-second wait as well.

Please take a look, we are keen to having this bug-fix available to enable use of 4.x in certain environments.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
This is simply a reminder of what we are going to look for before merging your code.

  • Mandatory: I (or my employer/client) have have signed the CA (see https://github.com/rabbitmq/cla)
  • I have read the CONTRIBUTING.md document
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

…er-nodes

are carried out only once on each configured rabbit.discovery_retry_limit loop.
The erpc_call/5 abstracts a maximum 10-second timeout which can further delay
connection attempts beyond desired configured limits and intervals for nodes
which are remain unreacheable. Only single connection attempts are required at
this point, i.e. on the rabbit_peer_discovery:sync_desired_cluster/1 phase.
@michaelklishin michaelklishin merged commit 98d963b into rabbitmq:main Oct 18, 2025
290 of 291 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants