fix: avoid stopping CS with pending replies by dmga44 · Pull Request #756 · leil-io/saunafs

dmga44 · 2026-02-20T09:30:17Z

It was noticed that could happen that CS gracefully stops without sending all the expected responses. Those missing responses could cause the master to invalidate some chunk parts when it was expected everything to be ok.

The cause for such behavior comes from the span of time from the instant in which a worker thread (from a jobPool) gets a new job and the instant the job is processed. In that period of time, the job is not counted when calling getJobCount, so the jobPool may appear empty but there were still some jobs in need of processing. This was happening in the jobPools attending requests from clients, and a similar issue happened to the jobPools attending requests from the master.

The solution targets improving the accuracy of the stop conditions considering the previous issue.

Signed-off-by: Dave dave@leil.io

gemini-code-assist · 2026-02-20T09:30:36Z

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the chunkserver could prematurely terminate during graceful shutdown, leading to unfulfilled requests and potential data inconsistencies. The fix enhances the job tracking mechanism within the JobPool by introducing an atomic counter for unprocessed jobs and refining the conditions under which the chunkserver determines it is safe to exit. This ensures that all operations are completed before shutdown, thereby improving system reliability and data integrity.

Highlights

Accurate Job Tracking: Introduced an unprocessedJobs_ atomic counter in the JobPool class to precisely track jobs from the moment they are added until their callbacks are fully processed, addressing a previous inaccuracy where jobs picked up by worker threads were not counted.
Robust Termination Conditions: Added a new JobPool::isEmpty() method that checks both the jobsQueue and the unprocessedJobs_ counter, providing a more reliable indicator of whether the job pool is truly empty. This prevents premature chunkserver shutdowns.
Improved Chunkserver Exit Logic: Updated the masterconn_canexit and mainNetworkThreadCanExit functions to leverage the new JobPool::isEmpty() or getUnprocessedJobCount() methods, ensuring that the chunkserver only gracefully stops when all pending replies and jobs have been handled.

Changelog

src/chunkserver/bgjobs.cc
- Incremented unprocessedJobs_ when a job is added to the queue.
- Added getUnprocessedJobCount() to retrieve the count of jobs not yet processed.
- Implemented isEmpty() to check if the job queue and unprocessed jobs are both empty.
- Decremented unprocessedJobs_ when a job's callback is executed and it is removed from the hash.
src/chunkserver/bgjobs.h
- Declared getUnprocessedJobCount() and isEmpty() methods in the JobPool class.
- Added a std::atomic<uint32_t> unprocessedJobs_ member to JobPool to track jobs.
src/chunkserver/masterconn.cc
- Modified masterconn_canexit to use the new JobPool::isEmpty() method for both job pools.
- Removed the direct registration of masterconn_canexit with eventloop_canexitregister as it is now part of a combined exit check.
src/chunkserver/masterconn.h
- Declared the masterconn_canexit() function.
src/chunkserver/network_main_thread.cc
- Included masterconn.h for master connection functionalities.
- Changed mainNetworkThreadCanExit to return a bool and simplified its logic.
- Introduced a new canExit() function that combines the exit conditions of mainNetworkThreadCanExit() and masterconn_canexit().
- Registered the new combined canExit() function with eventloop_canexitregister.
src/chunkserver/network_worker_thread.cc
- Updated updateAndCheckTerminationStatus to use bgJobPool_->getUnprocessedJobCount() instead of bgJobPool_->getJobCount() for more accurate termination checks.

Activity

The pull request was authored by dmga44.
The changes focus on enhancing the graceful shutdown mechanism of the chunkserver to prevent data inconsistencies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to more accurately track pending jobs in JobPool to prevent premature shutdowns. The introduction of unprocessedJobs_ is a good approach to account for in-flight jobs. The changes are applied to the shutdown logic for both client-facing worker threads and the master connection.

My review identifies a critical issue in the implementation of JobPool::isEmpty() which, if left unaddressed, would mean the problem is not fully solved for the master connection shutdown logic. This issue involves potential data races and incorrect mutex usage, which violates the guideline for handling concurrently updated shared resources. I've provided suggestions to correct this. With that fix, the PR should effectively address the reported issue.

src/chunkserver/bgjobs.cc

src/chunkserver/bgjobs.h

dmga44 · 2026-02-20T09:34:39Z

@gemini-code-assist

Added a new JobPool::isEmpty() method that checks both the jobsQueue and the unprocessedJobs_ counter, providing a more reliable indicator of whether the job pool is truly empty.

isEmpty does not checks the unprocessedJobs_ counter. Only the jobsQueue and statusQueue.

gemini-code-assist · 2026-02-20T09:34:54Z

You are absolutely correct, @dmga44! My apologies for the inaccuracy in the summary. You're right that JobPool::isEmpty() specifically checks jobsQueue and statusQueue, not unprocessedJobs_ directly.

The unprocessedJobs_ counter is indeed crucial for the overall termination logic, as it's used by functions like mainNetworkThreadCanExit() to ensure all jobs are fully processed before shutdown. My summary incorrectly conflated the checks within isEmpty() with the broader use of unprocessedJobs_ in the chunkserver's exit conditions.

Thank you for pointing out this important detail and helping me to be more precise!

Copilot

Pull request overview

This PR aims to prevent the chunkserver from gracefully stopping while there are still pending replies by making shutdown “can-exit” conditions more accurate across network worker job pools and the master connection job pools.

Changes:

Update network worker termination logic to use a new “unprocessed job” counter instead of queue depth.
Consolidate eventloop can-exit checks into a single function that gates exit on both network thread termination and master connection readiness.
Introduce JobPool::getUnprocessedJobCount() and JobPool::isEmpty() and adapt masterconn_canexit() accordingly.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/chunkserver/network_worker_thread.cc	Uses `getUnprocessedJobCount()` to avoid terminating while work is still in-flight.
src/chunkserver/network_main_thread.cc	Wraps can-exit logic to combine network + masterconn exit readiness; registers new handler.
src/chunkserver/masterconn.h	Exposes `masterconn_canexit()` as `bool`.
src/chunkserver/masterconn.cc	Changes can-exit logic to rely on job-pool emptiness helpers and removes direct eventloop registration.
src/chunkserver/bgjobs.h	Adds new JobPool APIs and an unprocessed job counter field.
src/chunkserver/bgjobs.cc	Implements unprocessed job tracking and a new `isEmpty()` helper.

src/chunkserver/bgjobs.cc

src/chunkserver/bgjobs.h

src/chunkserver/masterconn.cc

src/chunkserver/bgjobs.cc

dmga44 · 2026-02-20T10:03:04Z

@gemini-code-assist Please review again.

gemini-code-assist

Code Review

The pull request effectively addresses the issue of premature chunk server termination by introducing an unprocessedJobs_ counter to accurately track jobs that have been added but not yet fully processed. The network_worker_thread.cc correctly integrates the new allJobsProcessed() method for its termination checks, which is a significant improvement. However, there's a critical inconsistency in the masterconn_canexit() function, which uses JobPool::isEmpty() instead of JobPool::allJobsProcessed(). This choice reintroduces the original problem for master-related job pools, as isEmpty() does not account for 'in-flight' jobs. Additionally, the memory ordering used for unprocessedJobs_ operations should be strengthened to ensure proper synchronization and visibility of job completion side effects during termination, aligning with best practices for concurrent shared resources.

src/chunkserver/masterconn.cc

src/chunkserver/bgjobs.cc

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

src/chunkserver/masterconn.cc

src/chunkserver/bgjobs.cc

src/chunkserver/network_main_thread.cc

src/chunkserver/bgjobs.cc

src/chunkserver/bgjobs.h

rolysr

LGTM. Nice catch again.

lgsilva3087

LGTM

ralcolea

Great job @dmga44! 👍

It was noticed that could happen that CS gracefully stops without sending all the expected responses. Those missing responses could cause the master to invalidate some chunk parts when it was expected everything to be ok. The cause for such behavior comes from the span of time from the instant in which a worker thread (from a jobPool) gets a new job and the instant the job is processed. In that period of time, the job is not counted when calling getJobCount, so the jobPool may appear empty but there were still some jobs in need of processing. This was happening the in jobPools attending requests from clients, and a similar issue happened to the jobPools attending requests from the master. The solution targets improving the accuracy of the stop conditions considering the previous issue. Signed-off-by: Dave <dave@leil.io>

dmga44 self-assigned this Feb 20, 2026

dmga44 requested review from GigaCronos, antuan96314, Copilot, lgsilva3087, ralcolea, rolysr and uristdwarf February 20, 2026 09:32

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

src/chunkserver/bgjobs.cc Show resolved Hide resolved

src/chunkserver/bgjobs.h Show resolved Hide resolved

Copilot started reviewing on behalf of dmga44 February 20, 2026 09:32 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

dmga44 force-pushed the fix-cs-stopping-too-soon branch from ab1d4bc to a651659 Compare February 20, 2026 09:48

dmga44 requested a review from Copilot February 20, 2026 10:02

Copilot started reviewing on behalf of dmga44 February 20, 2026 10:03 View session

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

src/chunkserver/masterconn.cc Outdated Show resolved Hide resolved

src/chunkserver/bgjobs.cc Show resolved Hide resolved

src/chunkserver/bgjobs.cc Show resolved Hide resolved

src/chunkserver/bgjobs.cc Show resolved Hide resolved

Copilot AI reviewed Feb 20, 2026

View reviewed changes

src/chunkserver/masterconn.cc Show resolved Hide resolved

src/chunkserver/bgjobs.cc Show resolved Hide resolved

src/chunkserver/network_main_thread.cc Outdated Show resolved Hide resolved

src/chunkserver/bgjobs.cc Show resolved Hide resolved

src/chunkserver/bgjobs.h Show resolved Hide resolved

dmga44 force-pushed the fix-cs-stopping-too-soon branch from a651659 to 85ac1c5 Compare February 20, 2026 13:39

rolysr approved these changes Feb 20, 2026

View reviewed changes

lgsilva3087 approved these changes Feb 20, 2026

View reviewed changes

ralcolea approved these changes Feb 20, 2026

View reviewed changes

dmga44 force-pushed the fix-cs-stopping-too-soon branch from 85ac1c5 to 916137c Compare February 20, 2026 16:26

dmga44 merged commit 4263673 into dev Feb 20, 2026
11 checks passed

dmga44 deleted the fix-cs-stopping-too-soon branch February 20, 2026 19:21

Conversation

dmga44 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

dmga44 commented Feb 20, 2026

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmga44 commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rolysr left a comment

Choose a reason for hiding this comment

Uh oh!

lgsilva3087 left a comment

Choose a reason for hiding this comment

Uh oh!

ralcolea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dmga44 commented Feb 20, 2026 •

edited

Loading