fix(master): avoid rebalancing to disconnected CS#763
Conversation
Summary of ChangesHello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the robustness of the master server's chunk rebalancing mechanism. It ensures that the system avoids attempting to replicate data to chunkservers that have been marked for termination, thereby preventing errors and improving the reliability of data distribution, especially during failure scenarios or planned shutdowns. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request fixes a bug where the master could attempt to rebalance chunk parts to a chunkserver that is in the process of being disconnected (in KILL mode). The change introduces a new helper function, matocsserv_is_killed, to check if a chunkserver is killed and uses it within rebalanceChunkParts to skip such servers. The implementation is correct and effectively addresses the issue. I have added one comment suggesting a null pointer check in the new helper function to improve its robustness as a public API, referencing a general rule about handling null arguments gracefully.
There was a problem hiding this comment.
Pull request overview
This PR prevents the master’s chunk rebalancing logic from selecting chunkservers that have transitioned to KILL mode after the cached server orderings (sortedServers_ / labeledSortedServers_) were computed, addressing replication attempts to disconnected servers and aiming to reduce flakiness in kill-during-write scenarios.
Changes:
- Add a new
matocsserv_is_killed()API to query whether a chunkserver is inKILLmode. - Skip
KILL-mode chunkservers as destinations duringChunkWorker::rebalanceChunkParts(). - Minor whitespace cleanup in
chunks.cc.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/master/matocsserv.h |
Declares matocsserv_is_killed() accessor for chunkserver mode. |
src/master/matocsserv.cc |
Implements matocsserv_is_killed() based on ChunkserverConnectionMode::KILL. |
src/master/chunks.cc |
Filters out killed servers when choosing rebalance destinations from cached sorted server lists. |
3ff3a4d to
15df136
Compare
15df136 to
986f1b4
Compare
|
@gemini-code-assist please review again. |
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the issue of rebalancing chunks to disconnected or killed chunkservers. The introduction of the updateSortedServers method in ChunkWorker and the gSortedServersNeedsRefresh flag, along with its associated helper functions, provides a robust mechanism to ensure that the list of available chunkservers is always up-to-date. The checks for matocsserv_is_killed in the rebalancing logic directly prevent the problematic behavior, which is a critical correctness improvement. The changes are well-encapsulated and align with the existing codebase patterns.
At the instant of looking for a server to replicate a part when rebalancing, the current implementation uses the sortedServers_ or labeledSortedServers_ variables. These containers may have stale chunkserver entries that have been recently switched to KILL mode or entries completely released. Those entries must not be used as a destination for the replicate parts, if the master does not crashes. These commit fixes those issues. The test test_kill_cs_while_writing_small_files should stop being flaky after the merge. The master crash was one of the causes for the falure of the test. Another possible outcome of those replication targetting disconnected CSs is that the replication read counter of the source CS (the remaining alive CS) remain increased indefinately, thus blocking future replications that needed to retrieve data from those servers. This very dangerous behavior was also causing failures in the previously mentioned test. Signed-off-by: Dave <dave@leil.io>
986f1b4 to
79c796d
Compare
At the instant of looking for a server to replicate a part when
rebalancing, the current implementation uses the sortedServers_ or
labeledSortedServers_ variables. These containers may have stale
chunkserver entries that have been recently switched to KILL mode
or entries completely released. Those entries must not be used as a
destination for the replicate parts, if the master does not crashes.
This PR fixes those issues.
The test test_kill_cs_while_writing_small_files should stop being flaky
after the merge.
The master crash was one of the causes for the failure of the test.
Another possible outcome of those replication targetting disconnected
CSs is that the replication read counter of the source CS (the
remaining alive CS) remain increased indefinately, thus blocking future
replications that needed to retrieve data from those servers. This very
dangerous behavior was also causing failures in the previously
mentioned test.
To the reviewers, please check Gemini's comments.
Signed-off-by: Dave dave@leil.io