Conversation
Summary of ChangesHello @ralcolea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a stability issue in uRaft TCP status queries that arose from blocking operations on the main thread, which caused delays and sporadic failures for status clients. By introducing a dedicated thread for status serving and implementing a cached response mechanism, the system can now handle status requests efficiently and reliably. This change prevents contention with core Raft operations, ensuring the status endpoint remains responsive even under load. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a solid improvement for stabilizing the uRaft TCP status queries. Decoupling the status serving via a dedicated thread and caching the response is a well-thought-out architectural change that should resolve the reported flakiness. The implementation is generally of high quality, with good attention to threading, resource management, and error handling. I've identified one critical issue that would cause a compilation failure and another area where code can be simplified for better maintainability. After addressing these points, this should be an excellent contribution.
There was a problem hiding this comment.
Pull request overview
This PR stabilizes the uRaft TCP status query endpoint by decoupling status serving from Raft execution, preventing sporadic failures when clients disconnect before receiving responses.
Changes:
- Introduced a dedicated io_context and thread for TCP status handling, isolating it from the main Raft operations
- Implemented a frequently-refreshed immutable cache of status responses to serve connections instantly without per-request generation
- Improved error handling and socket lifecycle management throughout the status server
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/uraft/uraftstatus.h | Added dedicated threading infrastructure (io_context, work guard, thread), atomic status cache, cache refresh timer, and deleted copy/move operations |
| src/uraft/uraftstatus.cc | Implemented status caching mechanism, dedicated status thread, improved error logging, refactored status generation with defensive coding, and enhanced socket lifecycle management |
6324953 to
9cbf123
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the instability of TCP status queries by decoupling the status server from the main Raft logic. The introduction of a dedicated io_context and thread for status handling, along with a cached status response, is a solid architectural improvement that eliminates contention and potential blocking issues. The code quality is high, with robust resource management, improved error handling, and good use of modern C++ features. The changes make the uRaft status endpoint more reliable and performant under load. The AI-generated comment regarding redundant error_code.clear() calls is valid and has been retained.
0bb3117 to
e02d8c4
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the uRaft status server to improve stability and responsiveness. By introducing a dedicated I/O context and thread for status handling, and implementing a frequently refreshed, immutable cache for status responses, the changes effectively decouple status serving from Raft execution. This addresses the issue of blocking operations in the main thread causing delays and sporadic query failures. The refactoring also includes improved socket lifecycle management and more robust error reporting. Overall, these are positive and well-considered architectural changes.
e02d8c4 to
33a00b1
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the instability of uRaft TCP status queries by decoupling the status server from the main Raft logic. The introduction of a dedicated io_context and thread for status handling, along with a frequently refreshed status cache, is a solid architectural improvement that should eliminate the contention and timing issues. The code is well-structured, with robust error handling for network operations and a clean separation of concerns. The suggestion to optimize status string generation aligns with our guidelines for performance-critical sections.
33a00b1 to
f4f7712
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the instability of TCP status queries by decoupling the status serving logic from the main Raft execution thread. The introduction of a dedicated io_context and thread for status handling, combined with a frequently refreshed, immutable cache for the status response, is a robust solution that eliminates the contention and timing issues of the previous implementation.
The changes are well-executed:
- A new
status_io_context andstatus_thread_correctly isolate the status server. - The use of
std::atomic<std::shared_ptr<...>>with acquire-release semantics forcached_status_provides an elegant and correct lock-free mechanism for sharing status data between threads. - Error handling for socket operations, especially for
accept, is now much more comprehensive and resilient, with clear classification of errors and appropriate retry strategies (immediate, backoff, or stop). - Resource management is improved, with proper cleanup of threads and Asio objects in the destructor, and better encapsulation in the header files (e.g., making members private, deleting copy/move operations).
- The implementation of
storeDatais made more efficient and safer by usingfmt::format_toand fixing a potential integer underflow bug.
Overall, this is a high-quality change that significantly improves the stability and robustness of the uRaft status endpoint.
f4f7712 to
1f0fa70
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the stability of the uRaft TCP status endpoint by decoupling status serving from the main Raft logic. The introduction of a dedicated thread, an io_context, and a cached status response is a robust solution to the previously observed random query failures. The error handling has been made much more resilient, and the overall code quality is high. My review includes a few suggestions to further enhance code clarity and resource management in the new implementation.
1f0fa70 to
7a085e7
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the stability and performance of the uRaft TCP status server. By introducing a dedicated thread and io_context for status requests, along with a frequently updated cache, it effectively decouples status serving from the main Raft logic. This prevents blocking operations in the Raft thread from affecting status queries. The error handling for socket operations is now much more robust, with detailed classification of accept errors and proper backoff strategies to prevent CPU spinning. The implementation also shows great attention to detail with modern C++ practices, such as improved resource management in destructors, efficient string formatting with fmt, and correct use of atomics for thread-safe data sharing.
I have one minor suggestion to improve code clarity in the destructor's error handling. Overall, this is an excellent and well-executed refactoring.
7a085e7 to
85f729c
Compare
85f729c to
95b1ecd
Compare
| #include <fmt/format.h> | ||
|
|
||
| #include "uraftstatus.h" | ||
| namespace { |
There was a problem hiding this comment.
Consider an empty line before namespace.
There was a problem hiding this comment.
Nice catch 👍
Applied.
src/uraft/uraftstatus.cc
Outdated
| syslog(priority, "uRaftStatus: %s: %s (%d)", what, error.message().c_str(), error.value()); | ||
| } | ||
|
|
||
| void shutdownAndClose(boost::asio::ip::tcp::socket &socket) { |
There was a problem hiding this comment.
Consider shutDownAndCloseSocket
There was a problem hiding this comment.
I liked your proposal. Updated.
| } | ||
|
|
||
| void uRaftStatusConnection::set_data(std::shared_ptr<const std::vector<uint8_t>> data) { | ||
| data_ = std::move(data); |
There was a problem hiding this comment.
Consider to add this comment:
// Safe to move: the parameter is passed by value (not a reference), so the caller's
// shared_ptr is unaffected. The underlying vector data remains valid until this
// connection destroys its reference, even if the cache is refreshed with new data.There was a problem hiding this comment.
Very nice suggestion. This is not straightforward.
Thanks for pointing it out! 👍
| boost::system::error_code result_error; | ||
| conn->socket().set_option(boost::asio::ip::tcp::no_delay(true), result_error); | ||
| logAsioError(LOG_WARNING, "set_option(TCP_NODELAY) failed", result_error); | ||
|
|
There was a problem hiding this comment.
Consider adding this comment:
// The atomic load increases the shared_ptr ref count, ensuring the data remains valid
// until the connection completes.There was a problem hiding this comment.
Great suggestion! 👍
This comment is really important. It explains why it's always safe to rely on the value of cached_status_.
src/uraft/uraftstatus.cc
Outdated
| case AcceptAction::ImmediateRetry: | ||
| acceptConnection(); | ||
| break; | ||
|
|
There was a problem hiding this comment.
Consider to remove the empty lines, like in other code sections.
src/uraft/uraftstatus.cc
Outdated
| } | ||
|
|
||
| acceptConnection(); | ||
| switch(classifyAcceptError(error)) { |
There was a problem hiding this comment.
Format this line please.
|
|
||
| boost::asio::deadline_timer status_cache_timer_; | ||
| boost::asio::deadline_timer retry_timer_; | ||
| std::atomic<std::shared_ptr<const std::vector<uint8_t>>> cached_status_; |
There was a problem hiding this comment.
Consider to document this variable.
There was a problem hiding this comment.
Nice suggestion! 👍
It was added documentation for cached_status_ and other attributes related with the new change.
There was a problem hiding this comment.
Consider to include in the tests an scenario with multiple concurrent connections.
There was a problem hiding this comment.
Nice suggestion 👍
I will update the test in the Ansible repo (test-fix-uraft-tcp-status) to run in multiple servers and generate more concurrent connections.
| @@ -1,20 +1,100 @@ | |||
| #include "common/platform.h" | |||
| #include "common/platform.h" // IWYU pragma: keep | |||
There was a problem hiding this comment.
I think that copyright should be added on top of this file.
There was a problem hiding this comment.
The copyright is missing for some uRaft files like uraftstatus.{h,cc}, uraftcontroller.{h,cc}, and uraft.{h,cc}.
However, this change will be done in a dedicated PR.
Previously, the uRaft process executed blocking helper operations (e.g. fork(), poll(), read()) inside runCommand(), all on the same io_context thread that also handled TCP status accept/write operations. As a result, the status handler could be delayed under load, and short-lived clients (such as nc) could disconnect before a response was written. This made the TCP status endpoint sensitive to contention and timing, leading to sporadic query failures. This commit restructures the uRaft status server to decouple status serving from Raft execution, eliminating the random failures observed during TCP status queries. Key changes: - Introduce a dedicated io_context and thread for TCP status handling. - Maintain a frequently refreshed, immutable cache of the serialized status response. - Serve incoming status connections directly from the cached snapshot instead of generating responses per connection. - Improve socket lifecycle handling and error reporting. - Replace boost::format() by fmt::format() in storeData() function. This commit fixes LS-282. Signed-off-by: Crash <crash@leil.io>
95b1ecd to
e97b110
Compare
|
After running some tests with different commands and the
The failures are always reported when using a command like: 🔥 What happens when using
Client behavior (
Now the server (
Here’s the key: Because the client has already sent If the client closes very quickly after sending the newline, the full sequence may be:
When that happens:
🎯 This is a classic TCP race When a client:
the server may receive:
depending on timing. If server writes after client is already gone:
The proof is that there are commands that do not produce the original failures: # Original command that produces failures
while sleep 0.01; do echo | nc localhost 9428 | grep state -q || echo fail $(date +%T); done
# Successful command
while sleep 0.01; do nc localhost 9428 < /dev/null | grep state -q || echo fail $(date +%T); done
# Another successful comand
while sleep 0.01; do echo -n "" | nc localhost 9428 | grep state -q || echo fail $(date +%T); done |
Previously, the uRaft process executed blocking helper operations (e.g. fork(), poll(), read()) inside runCommand(), all on the same io_context thread that also handled TCP status accept/write operations. As a result, the status handler could be delayed under load, and short-lived clients (such as nc) could disconnect before a response was written. This made the TCP status endpoint sensitive to contention and timing, leading to sporadic query failures.
This commit restructures the uRaft status server to decouple status serving from Raft execution, eliminating the random failures observed during TCP status queries.
Key changes:
This commit fixes LS-282.
For testing the new feature, it was created
test-fix-uraft-tcp-statusbranch in the Ansible repo.This branch modifies the
uraft_deadlock_multiple_electionsscenario to validate the uRaft TCP status stability.The new test is expected to fail for SaunaFS v5.6.0 and should pass for the branches including the fix.
Signed-off-by: Crash <crash@leil.io>