Skip to content

storage: introduce on_master_enable service#646

Open
mrForza wants to merge 3 commits intotarantool:masterfrom
mrForza:gh-214-stray-tcp-doubled-buckets
Open

storage: introduce on_master_enable service#646
mrForza wants to merge 3 commits intotarantool:masterfrom
mrForza:gh-214-stray-tcp-doubled-buckets

Conversation

@mrForza
Copy link
Copy Markdown
Contributor

@mrForza mrForza commented Mar 20, 2026

Before this patch the rebalancer and recovery service could start
just right after master switch (by auto master detection or manual
reconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. It could lead to doubled buckets according
to "Doubled buckets RFC".

To fix it we introduce a new storage service - on_master_enable
service. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services - rebalancer and recovery can't
start until on_master_enable set M.buckets_are_in_sync.

Closes #214

NO_TEST=bugfix
NO_DOC=bugfix

@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from be90d04 to 2f96b14 Compare March 21, 2026 11:57
if not M.on_master_enable_fiber or
M.on_master_enable_fiber:status() == 'dead' then
M.on_master_enable_fiber =
util.reloadable_fiber_new('vshard.on_master_enable',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need reloadable fiber here. Please, take a loot at the utll.reloadable_fiber_new implemetation and to the M.module_version, it doesn't make sense, when you don't have:

    while M.module_version == module_version do

Copy link
Copy Markdown
Contributor Author

@mrForza mrForza Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as that. If we create a service fiber without using reloadable_fiber_new we can face with reload_evoluation/storage.test failure.

Also all router's or storage's services use reloadable_fiber_new, may be it is good to make on_master_enable fiber creation consistent with other services.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't need that crutchy cancel of the self fiber if the fiber is not reloadable.

Let's leave it as that. If we create a service fiber without using reloadable_fiber_new we can face with reload_evoluation/storage.test failure.

Why does it fail?

Also all router's or storage's services use reloadable_fiber_new, may be it is good to make on_master_enable fiber creation consistent with other services.

They are constantly working in the loop services, our new service - is not.

@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch 2 times, most recently from 83458b3 to 396cc20 Compare March 25, 2026 13:08
@mrForza mrForza requested a review from Serpentian March 25, 2026 14:43
@mrForza mrForza assigned Serpentian and unassigned mrForza Mar 25, 2026
vardir = vardir,
clear_test_cfg_options = clear_test_cfg_options,
info_assert_alert = info_assert_alert,
bucket_move = bucket_move,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, prefix the functions with storage_. All storage related functions are named as that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


local function bucket_move(src_storage, dest_storage, bucket_id)
src_storage:exec(function(bucket_id, replicaset_id)
t.helpers.retrying({timeout = 60}, function()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_timeout is the default for such functions, no need to hardcode the 60. Same in the bucket_wait_transfer function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

local function bucket_wait_transfer(src_storage, dest_storage, bucket_id)
src_storage:exec(function(bucket_id)
t.helpers.retrying({timeout = 10}, function()
t.assert_equals(box.space._bucket:select(bucket_id), {})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: get will be better, you don't need to select over unique primary key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -846,6 +846,31 @@ local function info_assert_alert(alerts, alert_name)
t.fail(('There is no %s in alerts').format(alert_name))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: first and second commits are not refactoring (reason for no test and doc). Refactoring is related to the vshard code refactoring and these are just test reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

info_assert_alert = info_assert_alert,
bucket_move = bucket_move,
bucket_wait_transfer = bucket_wait_transfer,
storage_wait_pairsync = storage_wait_pairsync,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use vtest.cluster_wait_fullsync which is already exported?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree

if not down or (down.status == 'stopped' or
not vclock_lesseq(vclock, down.vclock)) then
if not down or down.status == 'stopped' or
not util.vclock_compare(vclock, down.vclock, comparator) then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're calling the function, which is not defined in the current commit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

for _, replica in ipairs(box.info.replication) do
-- The current vclock may be changed between iterations. We need to
-- track the most recent one.
local vclock = box.info.vclock
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use such function in the vshard.storage.sync. The function is supposed to wait, until all changes from the current node are on all other instances, if some instances are lagging and the current node constantly writes, the sync will never exit, since we constantly update the vclock.

However, this approach can be used in the newly created service, since we expect the service to be started on the master and all other nodes cannot write new transactions, so sooner or later it will end.

I don't see any good approaches to reuse the wait_lsn function in all places:

  1. Vclock cannot be updated on every iteration, when instance becomes leader it must synchronously wait for old service to die before starting the new one. I don't like the synchronous waiting part here.

  2. Vclock becomes an argument of the wait_lsn.The service constantly retries the wait_lsn part until success by passing the current vclock. In that solution there's no sense in wait_ part, since we will have to do the wait_lsn with really small timeout

Instead, I propose to move the loop iteration from the wait_lsn to the separate function, pass comparator and vclock there and use it in the wait_lsn and your newly created function. In wait_lsn we'll pass same vclock on every iter, in the new service - box.info.vclock (updated on every iter)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this issue with minimal changes. Now we can pass a comparable vclock in storage_wait_vclock_template. If vclock is passed we will use it in comparison with downstream.vclock otherwise we will use box.info.vclock of current storage on every loop iteration.

if not M.on_master_enable_fiber or
M.on_master_enable_fiber:status() == 'dead' then
M.on_master_enable_fiber =
util.reloadable_fiber_new('vshard.on_master_enable',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't need that crutchy cancel of the self fiber if the fiber is not reloadable.

Let's leave it as that. If we create a service fiber without using reloadable_fiber_new we can face with reload_evoluation/storage.test failure.

Why does it fail?

Also all router's or storage's services use reloadable_fiber_new, may be it is good to make on_master_enable fiber creation consistent with other services.

They are constantly working in the loop services, our new service - is not.

M.recovery_fiber =
util.reloadable_fiber_new('vshard.recovery', M, 'recovery_f')
end
else
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed, that the on_master_enable_fiber will wakeup sooner than all other rebalancer related fibers, so it may happen, that when they'll start the variable buckets_are_in_sync will still be true due to the old check. I'd expect the variable to be set to false, if instance becomes non master.

You can easily test it with manual wakeups of the fibers if you want to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Mar 26, 2026
mrForza added 3 commits March 31, 2026 17:48
Before this patch the `bucket_move` and `bucket_wait_transfer` helper
functions were used only in `storage_1_1_1_test`. However in future
patches these helpers can also be applicable (e.g. in tarantoolgh-214).

This patch moves `bucket_move` and `bucket_wait_transfer` into `vtest`
module so that we can use it in other tests.

Needed for tarantool#214

NO_TEST=test
NO_DOC=test
Before this patch we compared vclocks only in `wait_lsn` function in
storage module. However in future patches (e.g. tarantoolgh-214) we will need to
do this even in tests. Also in tarantoolgh-214 we will use very similar logic of
waiting vclocks but with different sign (all vclock components of
current storage should be "greater or equal" than components of replicas'
vclocks instead of "less or equal")

To avoid duplication of code we unify the process of vclocks' comparison
and transform `vclock_lesseq` into more general `vclock_compare` function
which can allow us to make different comparisons of vclocks by
comparator. We move this function in `util` vshard module.

Also we transform `wait_lsn` into `storage_wait_vclock_replicated`. This
function does the similar thing like `wait_lsn`, but the main logic has
migrated into `storage_wait_vclock_template` which is responsible for
waiting for passed vclock will satisfy the comparator condition.

Needed for tarantool#214

NO_TEST=refactoring
NO_DOC=refactoring
Before this patch the `rebalancer` and `recovery` service could start
just right after master switch (by `auto` master detection or manual
reconfiguration) before the master had time to sync its vclock with
other replicas in replicaset. It could lead to doubled buckets according
to "Doubled buckets RFC".

To fix it we introduce a new storage service - `on_master_enable`
service. If master is changed in replicaset, this service is triggered
and waits until newly elected master syncs its vclock with other
replicas. Other storage services - `rebalancer` and `recovery` can't
start until `on_master_enable` set `M.buckets_are_in_sync`.

Also we change `storage/storage.test`, `storage/recovery.test`,
`storage-luatest/log_verbosity_2_2_test` and `router/router.test` so
that they wouldn't failed. Now `rebalancer` and `recovery` services
don't start immediately after master switch and it can shake some tests.

Part of tarantool#214

NO_TEST=bugfix
NO_DOC=bugfix
@mrForza mrForza force-pushed the gh-214-stray-tcp-doubled-buckets branch from 396cc20 to 78cf3e9 Compare March 31, 2026 14:56
@mrForza mrForza requested a review from Serpentian April 1, 2026 10:14
@mrForza mrForza assigned Serpentian and unassigned mrForza Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stray TCP message with big delay may duplicate a bucket

2 participants