Improve logging of rebalancer and recovery #586

mrForza · 2025-08-12T17:01:12Z

Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.

Closes #212

NO_DOC=bugfix

Serpentian

These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

Serpentian

Oh, shit. I forgot to send the last message of review, I'm very sorry

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

test/rebalancer/rebalancer.test.lua

vshard/storage/init.lua

test/rebalancer/rebalancer.result

test/storage-luatest/storage_1_1_1_test.lua

Serpentian

Thank you a lot! We've made extensive work in making the logs about rebalancing informative. They will greatly assist with incident analysis

Gerold103

Thanks for the patch! Great that you have taken this task. These logs really needed some polishing.

test/storage-luatest/storage_1_1_1_test.lua

vshard/storage/init.lua

Gerold103 · 2025-11-24T21:43:34Z

vshard/storage/init.lua

    if not M.is_rebalancer_active or rebalancing_is_in_progress() then
-        return
+        return nil, lerror.make('Rebalancer is not active or is in progress')
    end
    local _bucket = box.space._bucket
    local status_index = _bucket.index.status
+    local repl_id = M.this_replica and M.this_replica.id or '_'
    if #status_index:select({BSENDING}, {limit = 1}) > 0 then
-        return
+        return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,
+                                  repl_id, 'sending')
    end
    if #status_index:select({BRECEIVING}, {limit = 1}) > 0 then
-        return
+        return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,
+                                  repl_id, 'receiving')
    end
    if #status_index:select({BGARBAGE}, {limit = 1}) > 0 then
-        return
+        return nil, lerror.vshard(lerror.code.BUCKET_INVALID_STATE,
+                                  repl_id, 'garbage')
    end


The prefix BUCKET_ suggests that this is a bucket-centric error. So like one specific bucket is bad or isn't found or alike. But this is actually a storage-centric or rebalancer-centric error. It signals that something can't be done because of the storage's state.

Also we already have here a generic code-less error above - 'Rebalancer is not active or is in progress'.

I suggest not to invent anything new here and just return lerror.make() again with a formatted string. Anyway you never use this new error code for anything except tests.

rebalancer-centric error

I renamed error into REBALANCER_INVALID_STATE - Replica %s has %s buckets during rebalancing.

I suggest not to invent anything new here

If we use lerror.make to log errors about non-active rebalancer's buckets we can face with issue. I have already discussed with @Serpentian about this issue, we have decided to leave this new errcode. May be in this thread we will come to a different decision.

Oh, I see, thanks for explaining. But why do we then still leave lerror.make() for the other error returned from the same function?

Gerold103 · 2025-11-24T21:44:10Z

vshard/storage/init.lua

    end
    local _bucket = box.space._bucket
    local status_index = _bucket.index.status
+    local repl_id = M.this_replica and M.this_replica.id or '_'


How can M.this_replica or its .id field be nil?

M.this_replica or its id can be nil in unit tests (e.g. unit/rebalancer) where we don't initialize replica object

Can you then make these tests have a proper ID? If this is east to do, I would say this is preferable to making the production code have special checks for tests.

vshard/storage/init.lua

Before this patch "Finish bucket recovery step ..." logs were printed at the end of recovery even if no buckets were successfully recovered. It led to unnecessary log records. This patch fixes the issue by adding an additional check for the number of recovered buckets. Part of tarantool#212 NO_DOC=bugfix

This patch introduces logging of buckets' ids which were recovered during recovery stage of storage. Part of tarantool#212 NO_DOC=bugfix

This patch adds rebalancer routes' logging. The log file now includes information about the source storage, the number of buckets, and the destination storage where the buckets will be moved. Since the rebalancer service has changed logging of routes that were sent, we change the `rebalancer/rebalancer.test.lua` and `rebalancer/stress_add_remove_several_rs.test.lua` tests. Part of tarantool#212 NO_DOC=bugfix

Before this patch the function `rebalancer_request_state` returned only nil in case of errors (e.g. presence of SENDING, RECEIVING, GARBAGE buckets, not active rebalancer). This makes it inconsistent compared to other rebalancer functions. Now, in case of errors we return `(nil, err)` instead of `nil`. It can help us to propagate a meaningful error in rebalancer service. In addition we introduce a new vshard error for cases when the storage has some non-active buckets during rebalancing - `BUCKET_INVALID_STATE`. Also we remove "Some buckets are not active" log from `rebalancer_service_f` because a meaningful error about non-active buckets is already returned from `rebalancer_download_states`. NO_TEST=refactoring NO_DOC=refactoring

Before this patch the function `rebalancer_download_states` didn't return information about replicaset from which the states could not be downloaded. As a result, the errors returned from `rebalancer_download_states` lack of valuable information about unhealthy replicaset. Now, we add replicaset.id into error which is returned from the `rebalancer_download_states`. It can help us to propagate replicaset.id to rebalancer service. Closes tarantool#212 NO_DOC=bugfix

Gerold103

💪✍️

Gerold103 · 2025-12-01T21:28:29Z

test/storage-luatest/storage_1_1_1_test.lua


    vtest.cluster_new(g, global_cfg)
    vtest.cluster_bootstrap(g, global_cfg)
-    vtest.cluster_rebalancer_disable(g)


Why did you enable the rebalancer in the first commit? It doesn't affect the rebalancer, no?

Gerold103 · 2025-12-01T21:29:48Z

test/storage-luatest/storage_1_1_1_test.lua

+    vtest.cluster_exec_each_master(g, function()
+        box.schema.create_space('test_space')
+        box.space.test_space:format({
+            {name = 'pk', type = 'unsigned'},
+            {name = 'bucket_id', type = 'unsigned'},
+        })
+        box.space.test_space:create_index('primary', {parts = {'pk'}})
+        box.space.test_space:create_index(
+            'bucket_id', {parts = {'bucket_id'}, unique = false})
+    end)


This is added in the first commit. And is not used here. Lets either move it to the commit which needs these things, or remove it entirely.

Gerold103 · 2025-12-01T21:31:14Z

test/storage-luatest/storage_1_1_1_test.lua

+        t.helpers.retrying({}, function()
+            -- The bucket on src_storage must be in "sending" state. The
+            -- recovery service on src_storage should not erase this bucket.
+            t.assert_equals(box.space._bucket:select(bucket_id)[1].status,
+                            'sending')
+        end)


Do you really need this in "retrying"? AFAIU, the bucket is guaranteed to be in SENDING state right after your bucket_send() above will fail. Which means this loop could be just an assertion. Right?

Gerold103 · 2025-12-01T21:31:37Z

test/storage-luatest/storage_1_1_1_test.lua

+        t.helpers.retrying({}, function()
+            -- The bucket on src_storage must be in "sending" state. The
+            -- recovery service on src_storage should not erase this bucket.
+            t.assert_equals(box.space._bucket:select(bucket_id)[1].status,


When you select a single tuple by a unique index, better use :get().

Gerold103 · 2025-12-01T21:34:34Z

test/storage-luatest/storage_1_1_1_test.lua

+    end)
+    local hanged_bucket_id_1 = vtest.storage_first_bucket(g.replica_1_a)
+    start_partial_bucket_move(g.replica_1_a, g.replica_2_a, hanged_bucket_id_1)
+    local hanged_bucket_id_2 = vtest.storage_first_bucket(g.replica_1_a)


hanged -> hung.

Gerold103 · 2025-12-01T21:37:04Z

test/storage-luatest/storage_1_1_1_test.lua

-- Reduce spam of "Finish bucket recovery step" logs in recovery
-- service (gh-212).
+-- Reduce spam of "Finish bucket recovery step" logs and add logging of
+-- recovered buckets in recovery service (gh-212).


It is actually nice how you made this message evolve together with the commits.

Gerold103 · 2025-12-01T21:39:03Z

test/storage-luatest/storage_1_1_1_test.lua


+local function move_bucket(src_storage, dest_storage, bucket_id)
+    src_storage:exec(function(bucket_id, replicaset_id)
+        local res, err = ivshard.storage.bucket_send(bucket_id, replicaset_id)


Please, give it a long timeout. Otherwise this test will be flaky.

Gerold103 · 2025-12-01T21:40:10Z

test/storage-luatest/storage_1_1_1_test.lua

+    local moved_bucket_from_3 = vtest.storage_first_bucket(g.replica_3_a)
+    move_bucket(g.replica_3_a, g.replica_1_a, moved_bucket_from_3)
+    t.helpers.retrying({timeout = 60}, function()
+        g.replica_1_a:exec(function() ivshard.storage.rebalancer_wakeup() end)


If this is the test case for which you have enabled the rebalancer in the first commit, then please, 1) enable the rebalancer here, 2) disable it in the end of the test case.

Gerold103 · 2025-12-01T21:40:51Z

test/rebalancer/rebalancer.result

+-- in rare cases we accidentally can get the wrong one. For example we wait for
+-- "receiving" status, but get "garbage" due to some previous rebalancer error.
+wait_rebalancer_state('Error during downloading rebalancer states:.*' ..  \
+                      'REBALANCER_INVALID_STATE', test_run)               \


The commit message still mentions BUCKET_INVALID_STATE.

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 6b1057d to 64cc837 Compare August 15, 2025 09:27

mrForza requested a review from Serpentian August 15, 2025 09:42

mrForza assigned Serpentian Aug 15, 2025

Serpentian reviewed Aug 20, 2025

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Aug 20, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 5a8b3f8 to f5c25f7 Compare August 22, 2025 15:52

mrForza assigned Serpentian and unassigned mrForza Aug 23, 2025

mrForza requested a review from Serpentian August 23, 2025 13:17

Serpentian reviewed Aug 25, 2025

View reviewed changes

vshard/storage/init.lua Show resolved Hide resolved

Serpentian reviewed Aug 25, 2025

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Aug 25, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 04c506f to ccff54f Compare September 10, 2025 07:47

mrForza assigned Serpentian and unassigned mrForza Sep 10, 2025

mrForza requested a review from Serpentian September 10, 2025 08:07

Serpentian requested changes Sep 15, 2025

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Sep 15, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 46add65 to a1c095b Compare September 17, 2025 13:22

mrForza assigned Serpentian and unassigned mrForza Sep 17, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 2a9cea8 to 69ca01a Compare November 12, 2025 09:38

mrForza requested a review from Serpentian November 12, 2025 09:59

mrForza assigned Serpentian and unassigned mrForza Nov 12, 2025

Serpentian reviewed Nov 12, 2025

View reviewed changes

test/rebalancer/rebalancer.test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/rebalancer/rebalancer.result Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian Nov 12, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 7f9a3f5 to dfcbdb3 Compare November 19, 2025 09:57

mrForza requested a review from Serpentian November 19, 2025 10:14

mrForza assigned Serpentian and unassigned mrForza Nov 19, 2025

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from a16c606 to 16b39a2 Compare November 19, 2025 13:25

Serpentian approved these changes Nov 20, 2025

View reviewed changes

Serpentian assigned Gerold103 and unassigned Serpentian Nov 20, 2025

Serpentian requested a review from Gerold103 November 20, 2025 10:01

Gerold103 reviewed Nov 24, 2025

View reviewed changes

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 16b39a2 to 9623c34 Compare November 25, 2025 14:12

mrForza added 5 commits November 26, 2025 11:16

recovery: add logging of recovered buckets

eefe222

This patch introduces logging of buckets' ids which were recovered during recovery stage of storage. Part of tarantool#212 NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 9623c34 to 62c138e Compare November 26, 2025 08:35

mrForza requested a review from Gerold103 November 26, 2025 09:20

mrForza assigned Gerold103 and unassigned Gerold103 Nov 26, 2025

Gerold103 reviewed Dec 1, 2025

View reviewed changes

Improve logging of rebalancer and recovery #586

Are you sure you want to change the base?

Improve logging of rebalancer and recovery #586

Conversation

mrForza commented Aug 12, 2025

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrForza Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mrForza Nov 25, 2025 •

edited

Loading