replicaset: introduce replica connection recovery service#643
Merged
Gerold103 merged 5 commits intotarantool:masterfrom Feb 23, 2026
Merged
Conversation
f1810e9 to
f2e5892
Compare
f2e5892 to
ddf737e
Compare
Serpentian
reviewed
Feb 18, 2026
Collaborator
Serpentian
left a comment
There was a problem hiding this comment.
Thank you for the patchset. Almost there) Only small comments
ddf737e to
410cbdd
Compare
Serpentian
approved these changes
Feb 19, 2026
Collaborator
Serpentian
left a comment
There was a problem hiding this comment.
Well done, very good patchset and testing! Very small nits are left)
410cbdd to
88a2f8c
Compare
Gerold103
reviewed
Feb 20, 2026
Collaborator
Gerold103
left a comment
There was a problem hiding this comment.
Nice! I like how simple it looks now. Just one nit for a test.
Before this patch the `replicaset` module didn't have any services which would be responsible for recreation of closed replicas' connections. This situation led to next bugs: 1) The replicaset closed an initial netbox connection during name check (`vconnect` stage) when storage error happened and didn't try to reconnect. This bug led to a situation when router couldn't restore a connection to failed replica and consequently couldn't make requests until restart. 2) If the connection to replica was closed manually or due to network error, the only way to recreate it - to call some of replicaset's methods such as `call`, `callrw` and so one. However when we invoke `callro`, a connection to master can't be recreated. This patch introduces a new replicaset's service - `replica_conn_recovery` which tries to reconnect to "closed" replicas every `RECONNECT_TIMEOUT` seconds. If the closed connection is restored, the service goes into "idling" mode with infinite timeout until this connection will closed. It can help us to keep all replica connections in healthy state and consequently: 1) fix 1st bug, as the router doesn't need to have logic for recreating closed connections. 2) fix 2nd bug, as we don't need to manually invoke replicaset's methods to restore closed connections to replicas. Part of tarantool#642 Part of tarantool#632 NO_DOC=bugfix
Before this patch the replica connection recovery service tried to recreate a failed connection to replica every `RECONNECT_TIMEOUT` (0.5) seconds. It could heavily load the cluster if amount of routers and failed replicas were about hundred or thousands. To avoid it we add a backoff interval to replica connection recovery service in order to decrease amount of requests which router sends to replicas with closed connections. If the failed replica has closed connection due to one of the 4 "backoff" errors: `STORAGE_IS_DISABLED`, `NO_SUCH_PROC`, `AccessDenied` or `INSTANCE_NAME_MISMATCH`, the service will wait `replica.backoff_ts + REPLICA_BACKOFF_INTERVAL - fiber_clock()` seconds. In other cases the service will wait `RECONNECT_TIMEOUT` seconds. Closes tarantool#642 NO_DOC=bugfix
In this patch we modify `test_vconnect_no_result` and `test_vconnect_check_no_future` tests by moving the main replicaset's logic into replica's closure in order to have replicaset's logs into replica log-file. It is neede for further patch in which we will try to grep vconnect's logs from replica log-file. Needed for tarantool#632 NO_DOC=refactoring NO_TEST=refactoring
This patch adds the logging that the connection was closed due to storage error in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions. Closes tarantool#632 NO_DOC=bugfix
This patch fixes typo `FAILOVER_FOWN_TIMEOUT` in the comment before the `replica_check_health` function. NO_DOC=refactoring NO_TEST=refactoring
88a2f8c to
a844f13
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before this patch the
replicasetmodule didn't have any services whichwould be responsible for recreation of closed replicas' connections.
This situation led to next bugs:
(
vconnectstage) when storage error happened and didn't try toreconnect. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard.
error, the only way to recreate it - to call some of replicaset's
methods such as
call,callrwand so one. However when we invokecallro, a connection to master can't be recreated due to internalbug of this method.
This patch introduces a new replicaset's service -
replica_conn_recoverywhich tries to reconnect to "closed" replicasevery
RECONNECT_TIMEOUTseconds. If the closed connection is restored,the service goes into "idling" mode with infinite timeout until this
connection will closed. It can help us to keep all replica connections
in healthy state and as a result fix these bugs.
Part of #642
Part of #632
NO_DOC=bugfix