Skip to content

Conversation

@JacksonYao287
Copy link
Contributor

@JacksonYao287 JacksonYao287 commented Jan 7, 2026

expose nuraft_message to upper layer, so that they can register their own rpc call by calling bind_data_service_request

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch 2 times, most recently from bf220c9 to b069b44 Compare January 8, 2026 00:54
// becoming a leader.

RD_LOGD(NO_TRACE_ID, "become_leader_cb: setting traffic_ready_lsn from {} to {}", current_gate, new_gate);
m_listener->on_become_leader(m_group_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the purpose of on_become_xxx. Should we move it to the beginning of become_xxx_cb to call the upper callback first, similar to how handle_commit does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_become_leader and on_become_follower will be used for homeobject scrubbing. when scrubbing is on-going, if leader switch happens, the old leader will become follower and thus on_become_follower will be called on this node, where we stop the scrubbing thread that will request scrub results from the other two members. one of the follower will become leader and thus on_become_leader will be called on this node, where we will read the scrub superblk and start the scrubbing thread to request scrub result from the other two memebers.

Should we move it to the beginning
no need to do this now, what we want is to make sure that we can be notified with the leader change event, it does not matter it is called in the beginning or end.

@xiaoxichen
Copy link
Collaborator

xiaoxichen commented Jan 8, 2026

Can we do a POC regarding the required communication primitives
I am not sure if it is right to break the HS/HO boundary here, or should we add a generic "USER" handler is good enough.

We can to the POC of HO in HS UT (which is easier to implement RepldevListener), that gives team better understanding the requirements

@JacksonYao287
Copy link
Contributor Author

I am now working on the POC code with this change to see if this is workable

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch from b069b44 to 2ee228f Compare January 8, 2026 12:08
@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.48%. Comparing base (1a0cef8) to head (5ec0d54).
⚠️ Report is 307 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 0.00% 6 Missing and 1 partial ⚠️
src/lib/replication/repl_dev/solo_repl_dev.h 0.00% 6 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
- Coverage   56.51%   48.48%   -8.03%     
==========================================
  Files         108      110       +2     
  Lines       10300    12772    +2472     
  Branches     1402     6137    +4735     
==========================================
+ Hits         5821     6193     +372     
+ Misses       3894     2493    -1401     
- Partials      585     4086    +3501     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch 2 times, most recently from d8fa192 to 0915f3e Compare January 15, 2026 07:53
@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch from 0915f3e to 5ec0d54 Compare January 15, 2026 08:03
@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Jan 15, 2026

this PR add three apis into repl_dev(exposes nuraft_mesg data service to upper layer) :
1 add_data_rpc_service: register a rpc call with a name
2 data_request_unidirectional: send an unidirectional data with a request name through nuraft_mesg data service
3 data_request_bidirectional: send an bidirectional data with a request name through nuraft_mesg data service

I have written a POC to test this in homeobject layer, where the deserialized VSM are transfered through the data channel and then compared with leader . please see
https://github.com/JacksonYao287/HomeObject/blob/a9928f1dbe1dd4695f6154c964ee5052b2a62a72/src/lib/homestore_backend/tests/hs_scrubber_tests.cpp#L25

you can build homestore using this PR to create a local homestore conan repo and then use the following branch to build homeobject , and then run homestore_scrubber_test to see how it works.
https://github.com/JacksonYao287/HomeObject/tree/add-separate-data-rpc

@xiaoxichen @Besroy any suggestion for adding this three api?

data_service_request_handler_t const& request_handler) = 0;

// send a unidirectional data rpc to dest with request_name and cli_buf
virtual nuraft_mesg::NullAsyncResult data_request_unidirectional(nuraft_mesg::destination_t const& dest,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a generic interface we probably cannot use nuraft_mesg::XXXX,
we can use either

replica_member_info

or

replica_id_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using destination_t = std::variant< peer_id_t, role_regex, svr_id_t >

destination_t support several type of input params, this is more adaptive,

maybe other upper layer(not homeobject) wants to use svr_id_t or role_regex .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can re-encapsulate or redefine a type that is fine... but not exposing a nuraft type. it breaks the boundary.
HO doesnt need to care if HS is using nuraft_mesg for data_request_unidirectional, or use its own grpc, or use http, tcp , ftp, whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree and actually I have already tried to expose a clean api, but it is a little complicated.

coming to (http, tcp , ftp, whatever.), when failing to transfer data, we need to tell the upper layer the detailed failure info(why fails). so IMO, no need to do such a high level re-encapsulatation. since we actually use grpc to transfer data in the bottom layer, when grpc error happens, we should return this error to upper layer so that it can have a clear idea why grpc fails.

however, in nuraft_mesg, data_service_request_bidirectional never return a grpc error, and always change a grpc error to nuraft error with grpc_status_to_nuraft_code, and I am confused about the necessity here. As a result, at homestore level , when error happens, we can only get the error of a nuraft type for data_service_request_bidirectional . so if we do encapsulation here, we have three choices:

  1. do not re-encapsulate, return directly as it is.

  2. add a new function to convert nuraft error to grpc error, but this will miss somed detailed information. for example.

    case ::grpc::StatusCode::INVALID_ARGUMENT:
    case ::grpc::StatusCode::UNIMPLEMENTED:
    case ::grpc::StatusCode::UNAUTHENTICATED:
    case ::grpc::StatusCode::PERMISSION_DENIED:
    case ::grpc::StatusCode::RESOURCE_EXHAUSTED:
    case ::grpc::StatusCode::OUT_OF_RANGE:
        return nuraft::cmd_result_code::BAD_REQUEST;

if we want to convert nuraft::cmd_result_code::BAD_REQUEST to grpc error , which grpc should we chose.

3 add a rude convert, which will lump all the nuraft error types under a single umbrella error type with no detailed info. then the upper lay can only know this transfer failure, but have no idea why it fails.

for now, I chose 1, although it looks not that ideal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well each layer is expected to lost some information , for example when you calling an HTTP service got a 500, how you expect to know the detail error inside the 500, is that GW cannot talk to Nudata? or GW cannot reach DM?

Another example you got EIO when your disk IO fail, the specific device error code is hidden. .

When an underlying error is important , then it is time to create a specific error mapping, for that error. We can add a log line in the error converter if needed.

It is even fine we use same type and directly return what upstream returns, but just , dont directly expose the upstream type .


bool RaftReplDev::add_data_rpc_service(std::string const& request_name,
data_service_request_handler_t const& request_handler) {
return m_msg_mgr.bind_data_service_request(request_name, m_group_id, request_handler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, kind of concerning it will be a problem during shutdown, the HO will be deconstructed first then the handler is invalid.

It is not major atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point. but actually , we call homestore::shutdown in HSHomeObject::shutdown(), this means Homeobject is not destructed until homestore shutdown return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants