fix: improve CS behavior with very low bgjobscnt#760
Conversation
Summary of ChangesHello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance issue where chunkserver network workers could block indefinitely when attempting to add jobs to a full Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces significant changes to the ProducerConsumerQueue and ChunkserverEntry classes to improve the chunkserver's behavior under low BGJOBSCNT_PER_NETWORK_WORKER settings. The put function in ProducerConsumerQueue is now non-blocking, allowing the queue's maxSize to be violated, which is a critical change in its behavior. The ChunkserverEntry now checks if the job pool is full after high-level operations and closes the connection if it is, transitioning to a new IOFinish state. The renaming of WriteFinish to IOFinish better reflects its broader use. The changes in pcqueue_unittest.cc reflect the new non-blocking put behavior by using tryPut in tests. The new test test_cs_very_low_bgjobscnt.sh validates the improved behavior in a low job count scenario. Overall, the changes address a potential responsiveness issue and enhance the system's resilience, but the modification of ProducerConsumerQueue::put to be non-blocking and potentially violate maxSize should be carefully considered for its implications across the codebase.
There was a problem hiding this comment.
Pull request overview
This PR improves the chunkserver's behavior under very low BGJOBSCNT_PER_NETWORK_WORKER settings by making the ProducerConsumerQueue::put() method non-blocking. Previously, when the job pool was full, network workers would block on put() calls, causing severe responsiveness degradation.
Changes:
- Modified ProducerConsumerQueue::put() to be non-blocking and allow maxSize violations
- Added backpressure mechanism by checking job pool fullness after completing high-level operations and transitioning to IOFinish state if full
- Renamed WriteFinish state to IOFinish to better reflect its broader usage
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_suites/ShortSystemTests/test_cs_very_low_bgjobscnt.sh | New test to verify system behavior with very low background job counts |
| src/common/pcqueue_unittest.cc | Updated unit tests to use tryPut() instead of put() where appropriate |
| src/common/pcqueue.h | Changed put() signature to void, removed notFull_ condition variable, updated documentation |
| src/common/pcqueue.cc | Removed blocking logic from put(), removed notFull_ notifications from get methods, updated sizeLeft() calculation |
| src/chunkserver/bgjobs.h | Added isFull() method to JobPool interface |
| src/chunkserver/bgjobs.cc | Implemented isFull() method delegating to jobsQueue->isFull() |
| src/chunkserver/chunkserver_entry.h | Renamed WriteFinish to IOFinish, updated documentation |
| src/chunkserver/chunkserver_entry.cc | Updated all WriteFinish references to IOFinish, added job pool fullness check in writeEnd() |
| src/chunkserver/network_worker_thread.cc | Updated state references from WriteFinish to IOFinish |
| src/chunkserver/chunk_high_level_ops.cc | Changed read operations to use IOFinish on errors, added job pool fullness checks |
6b44c59 to
af7b7e2
Compare
af7b7e2 to
654e9b8
Compare
The current implementation of ProducerConsumerQueue::put blocks the caller thread until the queue is not full. One of such callers are the network workers of the chunkserver. This can cause a significant decrease in the system responsiveness when the parameter BGJOBSCNT_PER_NETWORK_WORKER is low given its workload. The changes proposed are the following: - make the put function non-blocking, i.e the maxSize parameter of the pcqueues can be violated. - check if the job pool is full at the instant of finishing some high level operation. If the job pool is full close the connection (csentry) after sending pending statuses or continue using the connection (idle state). The intended idea is to allow the pcqueue to have its maxSize limit violated but not much. The other instances of that class don't set any element limit. Side changes: - rename the WriteFinish state to IOFinish, to better reflect it now comprises other cases. Signed-off-by: Dave <dave@leil.io>
654e9b8 to
406e0c6
Compare
The current implementation of ProducerConsumerQueue::put blocks the caller thread until the queue is not full. One of such callers are the network workers of the chunkserver. This can cause a significant decrease in the system responsiveness when the parameter BGJOBSCNT_PER_NETWORK_WORKER is low given its workload.
The changes proposed are the following:
pcqueues can be violated.
level operation. If the job pool is full close the connection (csentry) after sending pending statuses or continue using the connection (idle state).
The intended idea is to allow the pcqueue to have its maxSize limit violated but not much. The other instances of that class don't set any element limit.
Side changes:
comprises other cases.
A new test was added to check the expected behavior.
Signed-off-by: Dave dave@leil.io