Skip to content

replicaset: introduce replica connection recovery service#643

Merged
Gerold103 merged 5 commits intotarantool:masterfrom
mrForza:gh-632-connection-is-closed-during-init-connection
Feb 23, 2026
Merged

replicaset: introduce replica connection recovery service#643
Gerold103 merged 5 commits intotarantool:masterfrom
mrForza:gh-632-connection-is-closed-during-init-connection

Conversation

@mrForza
Copy link
Copy Markdown
Contributor

@mrForza mrForza commented Feb 12, 2026

Before this patch the replicaset module didn't have any services which
would be responsible for recreation of closed replicas' connections.
This situation led to next bugs:

  1. The replicaset closed an initial netbox connection during name check
    (vconnect stage) when storage error happened and didn't try to
    reconnect. This bug led to alerts on the router and as a result to
    problems in high-level products which worked using vshard.
  2. If the connection to replica was closed manually or due to network
    error, the only way to recreate it - to call some of replicaset's
    methods such as call, callrw and so one. However when we invoke
    callro, a connection to master can't be recreated due to internal
    bug of this method.

This patch introduces a new replicaset's service -
replica_conn_recovery which tries to reconnect to "closed" replicas
every RECONNECT_TIMEOUT seconds. If the closed connection is restored,
the service goes into "idling" mode with infinite timeout until this
connection will closed. It can help us to keep all replica connections
in healthy state and as a result fix these bugs.

Part of #642
Part of #632

NO_DOC=bugfix

@mrForza mrForza requested review from Gerold103 and Serpentian and removed request for Serpentian February 12, 2026 17:09
@mrForza mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from f1810e9 to f2e5892 Compare February 12, 2026 17:24
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such a fast coming back with a new patchset 🤯!

I have a few questions. Also in the PR description looks like GH automatically took the first commit's description, making #632 as "part of" but not "closes".

@mrForza mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from f2e5892 to ddf737e Compare February 17, 2026 18:42
@mrForza mrForza requested a review from Gerold103 February 17, 2026 19:46
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patchset. Almost there) Only small comments

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Feb 18, 2026
@mrForza mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from ddf737e to 410cbdd Compare February 18, 2026 21:00
@mrForza mrForza requested a review from Serpentian February 18, 2026 21:18
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, very good patchset and testing! Very small nits are left)

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Feb 19, 2026
@mrForza mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from 410cbdd to 88a2f8c Compare February 20, 2026 10:30
@mrForza mrForza assigned Gerold103 and unassigned Gerold103 and mrForza Feb 20, 2026
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like how simple it looks now. Just one nit for a test.

Before this patch the `replicaset` module didn't have any services which
would be responsible for recreation of closed replicas' connections.
This situation led to next bugs:
1) The replicaset closed an initial netbox connection during name check
   (`vconnect` stage) when storage error happened and didn't try to
   reconnect. This bug led to a situation when router couldn't restore
   a connection to failed replica and consequently couldn't make
   requests until restart.
2) If the connection to replica was closed manually or due to network
   error, the only way to recreate it - to call some of replicaset's
   methods such as `call`, `callrw` and so one. However when we invoke
   `callro`, a connection to master can't be recreated.

This patch introduces a new replicaset's service -
`replica_conn_recovery` which tries to reconnect to "closed" replicas
every `RECONNECT_TIMEOUT` seconds. If the closed connection is restored,
the service goes into "idling" mode with infinite timeout until this
connection will closed. It can help us to keep all replica connections
in healthy state and consequently:
1) fix 1st bug, as the router doesn't need to have logic for recreating
   closed connections.
2) fix 2nd bug, as we don't need to manually invoke replicaset's methods
   to restore closed connections to replicas.

Part of tarantool#642
Part of tarantool#632

NO_DOC=bugfix
Before this patch the replica connection recovery service tried to
recreate a failed connection to replica every `RECONNECT_TIMEOUT` (0.5)
seconds. It could heavily load the cluster if amount of routers and
failed replicas were about hundred or thousands.

To avoid it we add a backoff interval to replica connection recovery
service in order to decrease amount of requests which router sends to
replicas with closed connections. If the failed replica has closed
connection due to one of the 4 "backoff" errors: `STORAGE_IS_DISABLED`,
`NO_SUCH_PROC`, `AccessDenied` or `INSTANCE_NAME_MISMATCH`, the service
will wait `replica.backoff_ts + REPLICA_BACKOFF_INTERVAL - fiber_clock()`
seconds. In other cases the service will wait `RECONNECT_TIMEOUT`
seconds.

Closes tarantool#642

NO_DOC=bugfix
In this patch we modify `test_vconnect_no_result` and
`test_vconnect_check_no_future` tests by moving the main replicaset's
logic into replica's closure in order to have replicaset's logs into
replica log-file. It is neede for further patch in which we will try to
grep vconnect's logs from replica log-file.

Needed for tarantool#632

NO_DOC=refactoring
NO_TEST=refactoring
This patch adds the logging that the connection was closed due to
storage error in `conn_vconnect_check_or_close` and
`conn_vconnect_wait_or_close` functions.

Closes tarantool#632

NO_DOC=bugfix
This patch fixes typo `FAILOVER_FOWN_TIMEOUT` in the comment before
the `replica_check_health` function.

NO_DOC=refactoring
NO_TEST=refactoring
@mrForza mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from 88a2f8c to a844f13 Compare February 22, 2026 15:18
@mrForza mrForza requested a review from Gerold103 February 22, 2026 15:39
@mrForza mrForza assigned Gerold103 and unassigned Gerold103 Feb 22, 2026
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mrForza , amazing work and great patience from your side to address all our comments 🙏!

@Gerold103 Gerold103 merged commit cc7d103 into tarantool:master Feb 23, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

3 participants