replicaset: introduce replica connection recovery service by mrForza · Pull Request #643 · tarantool/vshard

mrForza · 2026-02-12T14:19:41Z

Before this patch the replicaset module didn't have any services which
would be responsible for recreation of closed replicas' connections.
This situation led to next bugs:

The replicaset closed an initial netbox connection during name check
(vconnect stage) when storage error happened and didn't try to
reconnect. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard.
If the connection to replica was closed manually or due to network
error, the only way to recreate it - to call some of replicaset's
methods such as call, callrw and so one. However when we invoke
callro, a connection to master can't be recreated due to internal
bug of this method.

This patch introduces a new replicaset's service -
replica_conn_recovery which tries to reconnect to "closed" replicas
every RECONNECT_TIMEOUT seconds. If the closed connection is restored,
the service goes into "idling" mode with infinite timeout until this
connection will closed. It can help us to keep all replica connections
in healthy state and as a result fix these bugs.

Part of #642
Part of #632

NO_DOC=bugfix

Gerold103

Thanks for such a fast coming back with a new patchset 🤯!

I have a few questions. Also in the PR description looks like GH automatically took the first commit's description, making #632 as "part of" but not "closes".

vshard/replicaset.lua

Serpentian

Thank you for the patchset. Almost there) Only small comments

test/replicaset-luatest/replicaset_3_test.lua

vshard/replicaset.lua

test/replicaset-luatest/vconnect_test.lua

Serpentian

Well done, very good patchset and testing! Very small nits are left)

test/replicaset-luatest/vconnect_test.lua

test/replicaset-luatest/replicaset_3_test.lua

Gerold103

Nice! I like how simple it looks now. Just one nit for a test.

test/replicaset-luatest/replicaset_3_test.lua

Before this patch the `replicaset` module didn't have any services which would be responsible for recreation of closed replicas' connections. This situation led to next bugs: 1) The replicaset closed an initial netbox connection during name check (`vconnect` stage) when storage error happened and didn't try to reconnect. This bug led to a situation when router couldn't restore a connection to failed replica and consequently couldn't make requests until restart. 2) If the connection to replica was closed manually or due to network error, the only way to recreate it - to call some of replicaset's methods such as `call`, `callrw` and so one. However when we invoke `callro`, a connection to master can't be recreated. This patch introduces a new replicaset's service - `replica_conn_recovery` which tries to reconnect to "closed" replicas every `RECONNECT_TIMEOUT` seconds. If the closed connection is restored, the service goes into "idling" mode with infinite timeout until this connection will closed. It can help us to keep all replica connections in healthy state and consequently: 1) fix 1st bug, as the router doesn't need to have logic for recreating closed connections. 2) fix 2nd bug, as we don't need to manually invoke replicaset's methods to restore closed connections to replicas. Part of tarantool#642 Part of tarantool#632 NO_DOC=bugfix

Before this patch the replica connection recovery service tried to recreate a failed connection to replica every `RECONNECT_TIMEOUT` (0.5) seconds. It could heavily load the cluster if amount of routers and failed replicas were about hundred or thousands. To avoid it we add a backoff interval to replica connection recovery service in order to decrease amount of requests which router sends to replicas with closed connections. If the failed replica has closed connection due to one of the 4 "backoff" errors: `STORAGE_IS_DISABLED`, `NO_SUCH_PROC`, `AccessDenied` or `INSTANCE_NAME_MISMATCH`, the service will wait `replica.backoff_ts + REPLICA_BACKOFF_INTERVAL - fiber_clock()` seconds. In other cases the service will wait `RECONNECT_TIMEOUT` seconds. Closes tarantool#642 NO_DOC=bugfix

In this patch we modify `test_vconnect_no_result` and `test_vconnect_check_no_future` tests by moving the main replicaset's logic into replica's closure in order to have replicaset's logs into replica log-file. It is neede for further patch in which we will try to grep vconnect's logs from replica log-file. Needed for tarantool#632 NO_DOC=refactoring NO_TEST=refactoring

This patch adds the logging that the connection was closed due to storage error in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions. Closes tarantool#632 NO_DOC=bugfix

This patch fixes typo `FAILOVER_FOWN_TIMEOUT` in the comment before the `replica_check_health` function. NO_DOC=refactoring NO_TEST=refactoring

Gerold103

Thanks, @mrForza , amazing work and great patience from your side to address all our comments 🙏!

mrForza assigned Serpentian and Gerold103 Feb 12, 2026

mrForza requested review from Gerold103 and Serpentian and removed request for Serpentian February 12, 2026 17:09

mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from f1810e9 to f2e5892 Compare February 12, 2026 17:24

mrForza linked an issue Feb 12, 2026 that may be closed by this pull request

Connection is closed during name check on initial connection, when retryable error happens #632

Closed

Gerold103 reviewed Feb 13, 2026

View reviewed changes

vshard/replicaset.lua Outdated Show resolved Hide resolved

vshard/replicaset.lua Outdated Show resolved Hide resolved

mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from f2e5892 to ddf737e Compare February 17, 2026 18:42

mrForza linked an issue Feb 17, 2026 that may be closed by this pull request

replicaset: router's connection to replica can't be recreated by rs:callro #642

Closed

mrForza unassigned Gerold103 and Serpentian Feb 17, 2026

mrForza requested a review from Gerold103 February 17, 2026 19:46

mrForza assigned Gerold103 and Serpentian Feb 17, 2026

Serpentian reviewed Feb 18, 2026

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Feb 18, 2026

mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from ddf737e to 410cbdd Compare February 18, 2026 21:00

mrForza unassigned Gerold103 and mrForza Feb 18, 2026

mrForza requested a review from Serpentian February 18, 2026 21:18

mrForza assigned Gerold103 and Serpentian Feb 18, 2026

Serpentian approved these changes Feb 19, 2026

View reviewed changes

test/replicaset-luatest/vconnect_test.lua Show resolved Hide resolved

test/replicaset-luatest/replicaset_3_test.lua Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian Feb 19, 2026

Serpentian added the full-ci label Feb 19, 2026

mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from 410cbdd to 88a2f8c Compare February 20, 2026 10:30

mrForza assigned Gerold103 and unassigned Gerold103 and mrForza Feb 20, 2026

Gerold103 reviewed Feb 20, 2026

View reviewed changes

test/replicaset-luatest/replicaset_3_test.lua Outdated Show resolved Hide resolved

mrForza added 5 commits February 22, 2026 18:16

replicaset: log connection closing in vconnect functions

9b83145

This patch adds the logging that the connection was closed due to storage error in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions. Closes tarantool#632 NO_DOC=bugfix

replicaset: fix typo in replica_check_health's comment

a844f13

This patch fixes typo `FAILOVER_FOWN_TIMEOUT` in the comment before the `replica_check_health` function. NO_DOC=refactoring NO_TEST=refactoring

mrForza force-pushed the gh-632-connection-is-closed-during-init-connection branch from 88a2f8c to a844f13 Compare February 22, 2026 15:18

mrForza requested a review from Gerold103 February 22, 2026 15:39

mrForza assigned Gerold103 and unassigned Gerold103 Feb 22, 2026

Gerold103 approved these changes Feb 23, 2026

View reviewed changes

Gerold103 merged commit cc7d103 into tarantool:master Feb 23, 2026
11 checks passed

Conversation

mrForza commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrForza commented Feb 12, 2026 •

edited

Loading