Skip to content

fix(master): avoid rebalancing to disconnected CS#763

Merged
dmga44 merged 1 commit intodevfrom
fix-flaky-kill-cs-when-writing
Feb 25, 2026
Merged

fix(master): avoid rebalancing to disconnected CS#763
dmga44 merged 1 commit intodevfrom
fix-flaky-kill-cs-when-writing

Conversation

@dmga44
Copy link
Collaborator

@dmga44 dmga44 commented Feb 24, 2026

At the instant of looking for a server to replicate a part when
rebalancing, the current implementation uses the sortedServers_ or
labeledSortedServers_ variables. These containers may have stale
chunkserver entries that have been recently switched to KILL mode
or entries completely released. Those entries must not be used as a
destination for the replicate parts, if the master does not crashes.
This PR fixes those issues.

The test test_kill_cs_while_writing_small_files should stop being flaky
after the merge.

The master crash was one of the causes for the failure of the test.
Another possible outcome of those replication targetting disconnected
CSs is that the replication read counter of the source CS (the
remaining alive CS) remain increased indefinately, thus blocking future
replications that needed to retrieve data from those servers. This very
dangerous behavior was also causing failures in the previously
mentioned test.

To the reviewers, please check Gemini's comments.

Signed-off-by: Dave dave@leil.io

@gemini-code-assist
Copy link

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the master server's chunk rebalancing mechanism. It ensures that the system avoids attempting to replicate data to chunkservers that have been marked for termination, thereby preventing errors and improving the reliability of data distribution, especially during failure scenarios or planned shutdowns.

Highlights

  • Rebalancing Logic Improvement: Modified the chunk rebalancing logic to prevent selecting chunkservers that are in 'KILL' mode as destinations for replicated parts, ensuring data is not sent to disconnected or terminating servers.
  • Chunkserver Status Check: Introduced a new utility function, 'matocsserv_is_killed', to explicitly determine if a chunkserver is currently in 'KILL' mode.
  • Flaky Test Fix: Addressed a potential source of flakiness in 'test_kill_cs_while_writing_small_files' by ensuring rebalancing correctly avoids killed chunkservers, leading to more stable test results.
Changelog
  • src/master/chunks.cc
    • Added a check within 'rebalanceChunkParts' to skip chunkservers in 'KILL' mode.
    • Removed an extraneous empty line at the end of 'doChunkJobs'.
  • src/master/matocsserv.cc
    • Implemented the 'matocsserv_is_killed' function to check the 'ChunkserverConnectionMode'.
  • src/master/matocsserv.h
    • Declared the 'matocsserv_is_killed' function with a brief description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug where the master could attempt to rebalance chunk parts to a chunkserver that is in the process of being disconnected (in KILL mode). The change introduces a new helper function, matocsserv_is_killed, to check if a chunkserver is killed and uses it within rebalanceChunkParts to skip such servers. The implementation is correct and effectively addresses the issue. I have added one comment suggesting a null pointer check in the new helper function to improve its robustness as a public API, referencing a general rule about handling null arguments gracefully.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents the master’s chunk rebalancing logic from selecting chunkservers that have transitioned to KILL mode after the cached server orderings (sortedServers_ / labeledSortedServers_) were computed, addressing replication attempts to disconnected servers and aiming to reduce flakiness in kill-during-write scenarios.

Changes:

  • Add a new matocsserv_is_killed() API to query whether a chunkserver is in KILL mode.
  • Skip KILL-mode chunkservers as destinations during ChunkWorker::rebalanceChunkParts().
  • Minor whitespace cleanup in chunks.cc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/master/matocsserv.h Declares matocsserv_is_killed() accessor for chunkserver mode.
src/master/matocsserv.cc Implements matocsserv_is_killed() based on ChunkserverConnectionMode::KILL.
src/master/chunks.cc Filters out killed servers when choosing rebalance destinations from cached sorted server lists.

@dmga44 dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 3ff3a4d to 15df136 Compare February 24, 2026 15:59
Copy link
Collaborator

@rolysr rolysr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dmga44 dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 15df136 to 986f1b4 Compare February 24, 2026 17:22
@dmga44 dmga44 requested a review from Copilot February 24, 2026 17:23
@dmga44
Copy link
Collaborator Author

dmga44 commented Feb 24, 2026

@gemini-code-assist please review again.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of rebalancing chunks to disconnected or killed chunkservers. The introduction of the updateSortedServers method in ChunkWorker and the gSortedServersNeedsRefresh flag, along with its associated helper functions, provides a robust mechanism to ensure that the list of available chunkservers is always up-to-date. The checks for matocsserv_is_killed in the rebalancing logic directly prevent the problematic behavior, which is a critical correctness improvement. The changes are well-encapsulated and align with the existing codebase patterns.

@dmga44 dmga44 requested a review from rolysr February 24, 2026 17:31
At the instant of looking for a server to replicate a part when
rebalancing, the current implementation uses the sortedServers_ or
labeledSortedServers_ variables. These containers may have stale
chunkserver entries that have been recently switched to KILL mode
or entries completely released. Those entries must not be used as a
destination for the replicate parts, if the master does not crashes.
These commit fixes those issues.

The test test_kill_cs_while_writing_small_files should stop being flaky
after the merge.

The master crash was one of the causes for the falure of the test.
Another possible outcome of those replication targetting disconnected
CSs is that the replication read counter of the source CS (the
remaining alive CS) remain increased indefinately, thus blocking future
replications that needed to retrieve data from those servers. This very
dangerous behavior was also causing failures in the previously
mentioned test.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the fix-flaky-kill-cs-when-writing branch from 986f1b4 to 79c796d Compare February 24, 2026 23:01
Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmga44 dmga44 merged commit b6ebcdf into dev Feb 25, 2026
11 checks passed
@dmga44 dmga44 deleted the fix-flaky-kill-cs-when-writing branch February 25, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants