Skip to content

fix: avoid stopping CS with pending replies#756

Merged
dmga44 merged 1 commit intodevfrom
fix-cs-stopping-too-soon
Feb 20, 2026
Merged

fix: avoid stopping CS with pending replies#756
dmga44 merged 1 commit intodevfrom
fix-cs-stopping-too-soon

Conversation

@dmga44
Copy link
Collaborator

@dmga44 dmga44 commented Feb 20, 2026

It was noticed that could happen that CS gracefully stops without sending all the expected responses. Those missing responses could cause the master to invalidate some chunk parts when it was expected everything to be ok.

The cause for such behavior comes from the span of time from the instant in which a worker thread (from a jobPool) gets a new job and the instant the job is processed. In that period of time, the job is not counted when calling getJobCount, so the jobPool may appear empty but there were still some jobs in need of processing. This was happening in the jobPools attending requests from clients, and a similar issue happened to the jobPools attending requests from the master.

The solution targets improving the accuracy of the stop conditions considering the previous issue.

Signed-off-by: Dave dave@leil.io

@gemini-code-assist
Copy link

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the chunkserver could prematurely terminate during graceful shutdown, leading to unfulfilled requests and potential data inconsistencies. The fix enhances the job tracking mechanism within the JobPool by introducing an atomic counter for unprocessed jobs and refining the conditions under which the chunkserver determines it is safe to exit. This ensures that all operations are completed before shutdown, thereby improving system reliability and data integrity.

Highlights

  • Accurate Job Tracking: Introduced an unprocessedJobs_ atomic counter in the JobPool class to precisely track jobs from the moment they are added until their callbacks are fully processed, addressing a previous inaccuracy where jobs picked up by worker threads were not counted.
  • Robust Termination Conditions: Added a new JobPool::isEmpty() method that checks both the jobsQueue and the unprocessedJobs_ counter, providing a more reliable indicator of whether the job pool is truly empty. This prevents premature chunkserver shutdowns.
  • Improved Chunkserver Exit Logic: Updated the masterconn_canexit and mainNetworkThreadCanExit functions to leverage the new JobPool::isEmpty() or getUnprocessedJobCount() methods, ensuring that the chunkserver only gracefully stops when all pending replies and jobs have been handled.
Changelog
  • src/chunkserver/bgjobs.cc
    • Incremented unprocessedJobs_ when a job is added to the queue.
    • Added getUnprocessedJobCount() to retrieve the count of jobs not yet processed.
    • Implemented isEmpty() to check if the job queue and unprocessed jobs are both empty.
    • Decremented unprocessedJobs_ when a job's callback is executed and it is removed from the hash.
  • src/chunkserver/bgjobs.h
    • Declared getUnprocessedJobCount() and isEmpty() methods in the JobPool class.
    • Added a std::atomic<uint32_t> unprocessedJobs_ member to JobPool to track jobs.
  • src/chunkserver/masterconn.cc
    • Modified masterconn_canexit to use the new JobPool::isEmpty() method for both job pools.
    • Removed the direct registration of masterconn_canexit with eventloop_canexitregister as it is now part of a combined exit check.
  • src/chunkserver/masterconn.h
    • Declared the masterconn_canexit() function.
  • src/chunkserver/network_main_thread.cc
    • Included masterconn.h for master connection functionalities.
    • Changed mainNetworkThreadCanExit to return a bool and simplified its logic.
    • Introduced a new canExit() function that combines the exit conditions of mainNetworkThreadCanExit() and masterconn_canexit().
    • Registered the new combined canExit() function with eventloop_canexitregister.
  • src/chunkserver/network_worker_thread.cc
    • Updated updateAndCheckTerminationStatus to use bgJobPool_->getUnprocessedJobCount() instead of bgJobPool_->getJobCount() for more accurate termination checks.
Activity
  • The pull request was authored by dmga44.
  • The changes focus on enhancing the graceful shutdown mechanism of the chunkserver to prevent data inconsistencies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to more accurately track pending jobs in JobPool to prevent premature shutdowns. The introduction of unprocessedJobs_ is a good approach to account for in-flight jobs. The changes are applied to the shutdown logic for both client-facing worker threads and the master connection.

My review identifies a critical issue in the implementation of JobPool::isEmpty() which, if left unaddressed, would mean the problem is not fully solved for the master connection shutdown logic. This issue involves potential data races and incorrect mutex usage, which violates the guideline for handling concurrently updated shared resources. I've provided suggestions to correct this. With that fix, the PR should effectively address the reported issue.

@dmga44
Copy link
Collaborator Author

dmga44 commented Feb 20, 2026

@gemini-code-assist

Added a new JobPool::isEmpty() method that checks both the jobsQueue and the unprocessedJobs_ counter, providing a more reliable indicator of whether the job pool is truly empty.

isEmpty does not checks the unprocessedJobs_ counter. Only the jobsQueue and statusQueue.

@gemini-code-assist
Copy link

You are absolutely correct, @dmga44! My apologies for the inaccuracy in the summary. You're right that JobPool::isEmpty() specifically checks jobsQueue and statusQueue, not unprocessedJobs_ directly.

The unprocessedJobs_ counter is indeed crucial for the overall termination logic, as it's used by functions like mainNetworkThreadCanExit() to ensure all jobs are fully processed before shutdown. My summary incorrectly conflated the checks within isEmpty() with the broader use of unprocessedJobs_ in the chunkserver's exit conditions.

Thank you for pointing out this important detail and helping me to be more precise!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent the chunkserver from gracefully stopping while there are still pending replies by making shutdown “can-exit” conditions more accurate across network worker job pools and the master connection job pools.

Changes:

  • Update network worker termination logic to use a new “unprocessed job” counter instead of queue depth.
  • Consolidate eventloop can-exit checks into a single function that gates exit on both network thread termination and master connection readiness.
  • Introduce JobPool::getUnprocessedJobCount() and JobPool::isEmpty() and adapt masterconn_canexit() accordingly.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/chunkserver/network_worker_thread.cc Uses getUnprocessedJobCount() to avoid terminating while work is still in-flight.
src/chunkserver/network_main_thread.cc Wraps can-exit logic to combine network + masterconn exit readiness; registers new handler.
src/chunkserver/masterconn.h Exposes masterconn_canexit() as bool.
src/chunkserver/masterconn.cc Changes can-exit logic to rely on job-pool emptiness helpers and removes direct eventloop registration.
src/chunkserver/bgjobs.h Adds new JobPool APIs and an unprocessed job counter field.
src/chunkserver/bgjobs.cc Implements unprocessed job tracking and a new isEmpty() helper.

@dmga44 dmga44 force-pushed the fix-cs-stopping-too-soon branch from ab1d4bc to a651659 Compare February 20, 2026 09:48
@dmga44 dmga44 requested a review from Copilot February 20, 2026 10:02
@dmga44
Copy link
Collaborator Author

dmga44 commented Feb 20, 2026

@gemini-code-assist Please review again.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of premature chunk server termination by introducing an unprocessedJobs_ counter to accurately track jobs that have been added but not yet fully processed. The network_worker_thread.cc correctly integrates the new allJobsProcessed() method for its termination checks, which is a significant improvement. However, there's a critical inconsistency in the masterconn_canexit() function, which uses JobPool::isEmpty() instead of JobPool::allJobsProcessed(). This choice reintroduces the original problem for master-related job pools, as isEmpty() does not account for 'in-flight' jobs. Additionally, the memory ordering used for unprocessedJobs_ operations should be strengthened to ensure proper synchronization and visibility of job completion side effects during termination, aligning with best practices for concurrent shared resources.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

@dmga44 dmga44 force-pushed the fix-cs-stopping-too-soon branch from a651659 to 85ac1c5 Compare February 20, 2026 13:39
Copy link
Collaborator

@rolysr rolysr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice catch again.

Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @dmga44! 👍

It was noticed that could happen that CS gracefully stops without
sending all the expected responses. Those missing responses could cause
the master to invalidate some chunk parts when it was expected
everything to be ok.

The cause for such behavior comes from the span of time from the
instant in which a worker thread (from a jobPool) gets a new job and
the instant the job is processed. In that period of time, the job is
not counted when calling getJobCount, so the jobPool may appear empty
but there were still some jobs in need of processing. This was
happening the in jobPools attending requests from clients, and a
similar issue happened to the jobPools attending requests from the
master.

The solution targets improving the accuracy of the stop conditions
considering the previous issue.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the fix-cs-stopping-too-soon branch from 85ac1c5 to 916137c Compare February 20, 2026 16:26
@dmga44 dmga44 merged commit 4263673 into dev Feb 20, 2026
11 checks passed
@dmga44 dmga44 deleted the fix-cs-stopping-too-soon branch February 20, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants