fix(master): avoid rebalancing to disconnected CS by dmga44 · Pull Request #763 · leil-io/saunafs

dmga44 · 2026-02-24T15:51:36Z

At the instant of looking for a server to replicate a part when
rebalancing, the current implementation uses the sortedServers_ or
labeledSortedServers_ variables. These containers may have stale
chunkserver entries that have been recently switched to KILL mode
or entries completely released. Those entries must not be used as a
destination for the replicate parts, if the master does not crashes.
This PR fixes those issues.

The test test_kill_cs_while_writing_small_files should stop being flaky
after the merge.

The master crash was one of the causes for the failure of the test.
Another possible outcome of those replication targetting disconnected
CSs is that the replication read counter of the source CS (the
remaining alive CS) remain increased indefinately, thus blocking future
replications that needed to retrieve data from those servers. This very
dangerous behavior was also causing failures in the previously
mentioned test.

To the reviewers, please check Gemini's comments.

Signed-off-by: Dave dave@leil.io

gemini-code-assist · 2026-02-24T15:51:52Z

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the master server's chunk rebalancing mechanism. It ensures that the system avoids attempting to replicate data to chunkservers that have been marked for termination, thereby preventing errors and improving the reliability of data distribution, especially during failure scenarios or planned shutdowns.

Highlights

Rebalancing Logic Improvement: Modified the chunk rebalancing logic to prevent selecting chunkservers that are in 'KILL' mode as destinations for replicated parts, ensuring data is not sent to disconnected or terminating servers.
Chunkserver Status Check: Introduced a new utility function, 'matocsserv_is_killed', to explicitly determine if a chunkserver is currently in 'KILL' mode.
Flaky Test Fix: Addressed a potential source of flakiness in 'test_kill_cs_while_writing_small_files' by ensuring rebalancing correctly avoids killed chunkservers, leading to more stable test results.

Changelog

src/master/chunks.cc
- Added a check within 'rebalanceChunkParts' to skip chunkservers in 'KILL' mode.
- Removed an extraneous empty line at the end of 'doChunkJobs'.
src/master/matocsserv.cc
- Implemented the 'matocsserv_is_killed' function to check the 'ChunkserverConnectionMode'.
src/master/matocsserv.h
- Declared the 'matocsserv_is_killed' function with a brief description.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a bug where the master could attempt to rebalance chunk parts to a chunkserver that is in the process of being disconnected (in KILL mode). The change introduces a new helper function, matocsserv_is_killed, to check if a chunkserver is killed and uses it within rebalanceChunkParts to skip such servers. The implementation is correct and effectively addresses the issue. I have added one comment suggesting a null pointer check in the new helper function to improve its robustness as a public API, referencing a general rule about handling null arguments gracefully.

src/master/matocsserv.cc

Copilot

Pull request overview

This PR prevents the master’s chunk rebalancing logic from selecting chunkservers that have transitioned to KILL mode after the cached server orderings (sortedServers_ / labeledSortedServers_) were computed, addressing replication attempts to disconnected servers and aiming to reduce flakiness in kill-during-write scenarios.

Changes:

Add a new matocsserv_is_killed() API to query whether a chunkserver is in KILL mode.
Skip KILL-mode chunkservers as destinations during ChunkWorker::rebalanceChunkParts().
Minor whitespace cleanup in chunks.cc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/master/matocsserv.h`	Declares `matocsserv_is_killed()` accessor for chunkserver mode.
`src/master/matocsserv.cc`	Implements `matocsserv_is_killed()` based on `ChunkserverConnectionMode::KILL`.
`src/master/chunks.cc`	Filters out killed servers when choosing rebalance destinations from cached sorted server lists.

src/master/chunks.cc

rolysr

LGTM.

dmga44 · 2026-02-24T17:23:32Z

@gemini-code-assist please review again.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

gemini-code-assist

Code Review

The pull request effectively addresses the issue of rebalancing chunks to disconnected or killed chunkservers. The introduction of the updateSortedServers method in ChunkWorker and the gSortedServersNeedsRefresh flag, along with its associated helper functions, provides a robust mechanism to ensure that the list of available chunkservers is always up-to-date. The checks for matocsserv_is_killed in the rebalancing logic directly prevent the problematic behavior, which is a critical correctness improvement. The changes are well-encapsulated and align with the existing codebase patterns.

src/master/chunks.cc

src/master/matocsserv.cc

At the instant of looking for a server to replicate a part when rebalancing, the current implementation uses the sortedServers_ or labeledSortedServers_ variables. These containers may have stale chunkserver entries that have been recently switched to KILL mode or entries completely released. Those entries must not be used as a destination for the replicate parts, if the master does not crashes. These commit fixes those issues. The test test_kill_cs_while_writing_small_files should stop being flaky after the merge. The master crash was one of the causes for the falure of the test. Another possible outcome of those replication targetting disconnected CSs is that the replication read counter of the source CS (the remaining alive CS) remain increased indefinately, thus blocking future replications that needed to retrieve data from those servers. This very dangerous behavior was also causing failures in the previously mentioned test. Signed-off-by: Dave <dave@leil.io>

lgsilva3087

LGTM

dmga44 self-assigned this Feb 24, 2026

dmga44 requested review from GigaCronos, antuan96314, Copilot, lgsilva3087, ralcolea, rolysr and uristdwarf February 24, 2026 15:52

Copilot started reviewing on behalf of dmga44 February 24, 2026 15:52 View session

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

src/master/matocsserv.cc Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 24, 2026

View reviewed changes

src/master/chunks.cc Outdated Show resolved Hide resolved

dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 3ff3a4d to 15df136 Compare February 24, 2026 15:59

rolysr approved these changes Feb 24, 2026

View reviewed changes

dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 15df136 to 986f1b4 Compare February 24, 2026 17:22

dmga44 requested a review from Copilot February 24, 2026 17:23

Copilot started reviewing on behalf of dmga44 February 24, 2026 17:23 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

src/master/chunks.cc Show resolved Hide resolved

src/master/chunks.cc Show resolved Hide resolved

src/master/matocsserv.cc Show resolved Hide resolved

src/master/matocsserv.cc Show resolved Hide resolved

dmga44 requested a review from rolysr February 24, 2026 17:31

dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 986f1b4 to 79c796d Compare February 24, 2026 23:01

uristdwarf approved these changes Feb 25, 2026

View reviewed changes

lgsilva3087 approved these changes Feb 25, 2026

View reviewed changes

dmga44 merged commit b6ebcdf into dev Feb 25, 2026
11 checks passed

dmga44 deleted the fix-flaky-kill-cs-when-writing branch February 25, 2026 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(master): avoid rebalancing to disconnected CS#763

fix(master): avoid rebalancing to disconnected CS#763
dmga44 merged 1 commit intodevfrom
fix-flaky-kill-cs-when-writing

dmga44 commented Feb 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

rolysr left a comment

Uh oh!

dmga44 commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lgsilva3087 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dmga44 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

rolysr left a comment

Choose a reason for hiding this comment

Uh oh!

dmga44 commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lgsilva3087 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dmga44 commented Feb 24, 2026 •

edited

Loading