Skip to content

fix: improve CS behavior with very low bgjobscnt#760

Merged
dmga44 merged 1 commit intodevfrom
fix-cs-behavior-when-very-low-bgjobscnt
Feb 24, 2026
Merged

fix: improve CS behavior with very low bgjobscnt#760
dmga44 merged 1 commit intodevfrom
fix-cs-behavior-when-very-low-bgjobscnt

Conversation

@dmga44
Copy link
Collaborator

@dmga44 dmga44 commented Feb 22, 2026

The current implementation of ProducerConsumerQueue::put blocks the caller thread until the queue is not full. One of such callers are the network workers of the chunkserver. This can cause a significant decrease in the system responsiveness when the parameter BGJOBSCNT_PER_NETWORK_WORKER is low given its workload.

The changes proposed are the following:

  • make the put function non-blocking, i.e the maxSize parameter of the
    pcqueues can be violated.
  • check if the job pool is full at the instant of finishing some high
    level operation. If the job pool is full close the connection (csentry) after sending pending statuses or continue using the connection (idle state).

The intended idea is to allow the pcqueue to have its maxSize limit violated but not much. The other instances of that class don't set any element limit.

Side changes:

  • rename the WriteFinish state to IOFinish, to better reflect it now
    comprises other cases.

A new test was added to check the expected behavior.

Signed-off-by: Dave dave@leil.io

@dmga44 dmga44 self-assigned this Feb 22, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance issue where chunkserver network workers could block indefinitely when attempting to add jobs to a full ProducerConsumerQueue, particularly under configurations with a low BGJOBSCNT_PER_NETWORK_WORKER. The primary solution involves making the ProducerConsumerQueue::put operation non-blocking, allowing it to temporarily exceed its defined capacity. To manage this overflow and prevent system overload, high-level chunkserver operations now explicitly check the job pool's fullness upon completion. Depending on the job pool's state, connections are either gracefully closed via a newly named IOFinish state or returned to an Idle state, significantly improving system responsiveness and stability under heavy load with constrained background job resources.

Highlights

  • Non-blocking ProducerConsumerQueue::put: The ProducerConsumerQueue::put method has been modified to be non-blocking, preventing caller threads from stalling when the queue is full. This allows the queue's maxSize to be temporarily exceeded, with tryPut still respecting the limit.
  • Conditional Connection Closure in Chunkserver: Chunkserver connections (csentry) will now check if the job pool is full after completing high-level operations. If full, the connection transitions to the IOFinish state (a renamed WriteFinish state) to be closed after sending pending statuses; otherwise, it returns to the Idle state.
  • State Renaming: The WriteFinish state in chunkserver components has been renamed to IOFinish to more accurately reflect its broader use for I/O completion and connection closure, not just write errors.
  • JobPool Fullness Check: A new isFull() method has been added to the JobPool class, allowing for explicit checks of the job queue's capacity.
  • ProducerConsumerQueue Refinements: The ProducerConsumerQueue::sizeLeft() calculation was updated to correctly report available space even if currentSize temporarily exceeds maxSize. The notFull_ condition variable was removed as put is no longer blocking.
  • New System Test: A new system test has been added to validate chunkserver behavior under very low background job counts (BGJOBSCNT_PER_NETWORK_WORKER), ensuring stability and responsiveness in such scenarios.
Changelog
  • src/chunkserver/bgjobs.cc
    • Added isFull() method to JobPool to check if its internal jobsQueue is full.
  • src/chunkserver/bgjobs.h
    • Declared the new isFull() method in the JobPool class.
  • src/chunkserver/chunk_high_level_ops.cc
    • Changed state transition from Idle to IOFinish in readFinishedCallback to close the connection.
    • Modified readContinue to transition to IOFinish if the worker job pool is full, otherwise to Idle.
    • Updated state transitions from WriteFinish to IOFinish in updateUsingWriteStatusAndReply.
    • Updated comments to reflect the IOFinish state.
  • src/chunkserver/chunkserver_entry.cc
    • Replaced all occurrences of State::WriteFinish with State::IOFinish in error handling and state transitions.
    • Introduced a check for workerJobPool()->isFull() after writeEnd to conditionally transition to State::IOFinish instead of State::Idle.
  • src/chunkserver/chunkserver_entry.h
    • Renamed the WriteFinish enum member to IOFinish and updated its description.
    • Updated the documentation for fwdError to refer to IOFinish.
  • src/chunkserver/network_worker_thread.cc
    • Updated switch statements and if conditions to use ChunkserverEntry::State::IOFinish instead of WriteFinish.
  • src/common/pcqueue.cc
    • Adjusted sizeLeft() calculation to handle currentSize_ potentially exceeding maxSize_.
    • Modified put() method to be void and removed the blocking notFull_.wait() and the return true/false logic.
    • Removed notFull_.notify_one() calls from get() and tryGet().
  • src/common/pcqueue.h
    • Updated class and put() method comments to clarify non-blocking behavior and maxSize handling.
    • Changed put() method signature to void.
    • Removed the std::condition_variable notFull_ member.
  • src/common/pcqueue_unittest.cc
    • Replaced queue.put() calls with queue.tryPut() in relevant test cases to align with the new non-blocking put behavior.
  • tests/test_suites/ShortSystemTests/test_cs_very_low_bgjobscnt.sh
    • Added a new shell script to test chunkserver behavior with a very low BGJOBSCNT_PER_NETWORK_WORKER setting, involving writing and reading multiple files.
Activity
  • No specific activity (comments, reviews, progress updates) has been provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces significant changes to the ProducerConsumerQueue and ChunkserverEntry classes to improve the chunkserver's behavior under low BGJOBSCNT_PER_NETWORK_WORKER settings. The put function in ProducerConsumerQueue is now non-blocking, allowing the queue's maxSize to be violated, which is a critical change in its behavior. The ChunkserverEntry now checks if the job pool is full after high-level operations and closes the connection if it is, transitioning to a new IOFinish state. The renaming of WriteFinish to IOFinish better reflects its broader use. The changes in pcqueue_unittest.cc reflect the new non-blocking put behavior by using tryPut in tests. The new test test_cs_very_low_bgjobscnt.sh validates the improved behavior in a low job count scenario. Overall, the changes address a potential responsiveness issue and enhance the system's resilience, but the modification of ProducerConsumerQueue::put to be non-blocking and potentially violate maxSize should be carefully considered for its implications across the codebase.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the chunkserver's behavior under very low BGJOBSCNT_PER_NETWORK_WORKER settings by making the ProducerConsumerQueue::put() method non-blocking. Previously, when the job pool was full, network workers would block on put() calls, causing severe responsiveness degradation.

Changes:

  • Modified ProducerConsumerQueue::put() to be non-blocking and allow maxSize violations
  • Added backpressure mechanism by checking job pool fullness after completing high-level operations and transitioning to IOFinish state if full
  • Renamed WriteFinish state to IOFinish to better reflect its broader usage

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/test_suites/ShortSystemTests/test_cs_very_low_bgjobscnt.sh New test to verify system behavior with very low background job counts
src/common/pcqueue_unittest.cc Updated unit tests to use tryPut() instead of put() where appropriate
src/common/pcqueue.h Changed put() signature to void, removed notFull_ condition variable, updated documentation
src/common/pcqueue.cc Removed blocking logic from put(), removed notFull_ notifications from get methods, updated sizeLeft() calculation
src/chunkserver/bgjobs.h Added isFull() method to JobPool interface
src/chunkserver/bgjobs.cc Implemented isFull() method delegating to jobsQueue->isFull()
src/chunkserver/chunkserver_entry.h Renamed WriteFinish to IOFinish, updated documentation
src/chunkserver/chunkserver_entry.cc Updated all WriteFinish references to IOFinish, added job pool fullness check in writeEnd()
src/chunkserver/network_worker_thread.cc Updated state references from WriteFinish to IOFinish
src/chunkserver/chunk_high_level_ops.cc Changed read operations to use IOFinish on errors, added job pool fullness checks

@dmga44 dmga44 force-pushed the fix-cs-behavior-when-very-low-bgjobscnt branch 2 times, most recently from 6b44c59 to af7b7e2 Compare February 22, 2026 17:19
Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @dmga44! 👍 💪 🚀

Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmga44 dmga44 force-pushed the fix-cs-behavior-when-very-low-bgjobscnt branch from af7b7e2 to 654e9b8 Compare February 23, 2026 16:49
The current implementation of ProducerConsumerQueue::put blocks the
caller thread until the queue is not full. One of such callers are the
network workers of the chunkserver. This can cause a significant
decrease in the system responsiveness when the parameter
BGJOBSCNT_PER_NETWORK_WORKER is low given its workload.

The changes proposed are the following:
- make the put function non-blocking, i.e the maxSize parameter of the
pcqueues can be violated.
- check if the job pool is full at the instant of finishing some high
level operation. If the job pool is full close the connection (csentry)
after sending pending statuses or continue using the connection (idle
state).

The intended idea is to allow the pcqueue to have its maxSize limit
violated but not much. The other instances of that class don't set any
element limit.

Side changes:
- rename the WriteFinish state to IOFinish, to better reflect it now
comprises other cases.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the fix-cs-behavior-when-very-low-bgjobscnt branch from 654e9b8 to 406e0c6 Compare February 23, 2026 22:00
@dmga44 dmga44 merged commit 364242a into dev Feb 24, 2026
11 checks passed
@dmga44 dmga44 deleted the fix-cs-behavior-when-very-low-bgjobscnt branch February 24, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants