fix: avoid stopping CS with pending replies#756
Conversation
Summary of ChangesHello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue where the chunkserver could prematurely terminate during graceful shutdown, leading to unfulfilled requests and potential data inconsistencies. The fix enhances the job tracking mechanism within the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to more accurately track pending jobs in JobPool to prevent premature shutdowns. The introduction of unprocessedJobs_ is a good approach to account for in-flight jobs. The changes are applied to the shutdown logic for both client-facing worker threads and the master connection.
My review identifies a critical issue in the implementation of JobPool::isEmpty() which, if left unaddressed, would mean the problem is not fully solved for the master connection shutdown logic. This issue involves potential data races and incorrect mutex usage, which violates the guideline for handling concurrently updated shared resources. I've provided suggestions to correct this. With that fix, the PR should effectively address the reported issue.
isEmpty does not checks the unprocessedJobs_ counter. Only the jobsQueue and statusQueue. |
|
You are absolutely correct, @dmga44! My apologies for the inaccuracy in the summary. You're right that The Thank you for pointing out this important detail and helping me to be more precise! |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent the chunkserver from gracefully stopping while there are still pending replies by making shutdown “can-exit” conditions more accurate across network worker job pools and the master connection job pools.
Changes:
- Update network worker termination logic to use a new “unprocessed job” counter instead of queue depth.
- Consolidate eventloop can-exit checks into a single function that gates exit on both network thread termination and master connection readiness.
- Introduce
JobPool::getUnprocessedJobCount()andJobPool::isEmpty()and adaptmasterconn_canexit()accordingly.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/chunkserver/network_worker_thread.cc | Uses getUnprocessedJobCount() to avoid terminating while work is still in-flight. |
| src/chunkserver/network_main_thread.cc | Wraps can-exit logic to combine network + masterconn exit readiness; registers new handler. |
| src/chunkserver/masterconn.h | Exposes masterconn_canexit() as bool. |
| src/chunkserver/masterconn.cc | Changes can-exit logic to rely on job-pool emptiness helpers and removes direct eventloop registration. |
| src/chunkserver/bgjobs.h | Adds new JobPool APIs and an unprocessed job counter field. |
| src/chunkserver/bgjobs.cc | Implements unprocessed job tracking and a new isEmpty() helper. |
ab1d4bc to
a651659
Compare
|
@gemini-code-assist Please review again. |
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the issue of premature chunk server termination by introducing an unprocessedJobs_ counter to accurately track jobs that have been added but not yet fully processed. The network_worker_thread.cc correctly integrates the new allJobsProcessed() method for its termination checks, which is a significant improvement. However, there's a critical inconsistency in the masterconn_canexit() function, which uses JobPool::isEmpty() instead of JobPool::allJobsProcessed(). This choice reintroduces the original problem for master-related job pools, as isEmpty() does not account for 'in-flight' jobs. Additionally, the memory ordering used for unprocessedJobs_ operations should be strengthened to ensure proper synchronization and visibility of job completion side effects during termination, aligning with best practices for concurrent shared resources.
a651659 to
85ac1c5
Compare
It was noticed that could happen that CS gracefully stops without sending all the expected responses. Those missing responses could cause the master to invalidate some chunk parts when it was expected everything to be ok. The cause for such behavior comes from the span of time from the instant in which a worker thread (from a jobPool) gets a new job and the instant the job is processed. In that period of time, the job is not counted when calling getJobCount, so the jobPool may appear empty but there were still some jobs in need of processing. This was happening the in jobPools attending requests from clients, and a similar issue happened to the jobPools attending requests from the master. The solution targets improving the accuracy of the stop conditions considering the previous issue. Signed-off-by: Dave <dave@leil.io>
85ac1c5 to
916137c
Compare
It was noticed that could happen that CS gracefully stops without sending all the expected responses. Those missing responses could cause the master to invalidate some chunk parts when it was expected everything to be ok.
The cause for such behavior comes from the span of time from the instant in which a worker thread (from a jobPool) gets a new job and the instant the job is processed. In that period of time, the job is not counted when calling getJobCount, so the jobPool may appear empty but there were still some jobs in need of processing. This was happening in the jobPools attending requests from clients, and a similar issue happened to the jobPools attending requests from the master.
The solution targets improving the accuracy of the stop conditions considering the previous issue.
Signed-off-by: Dave dave@leil.io