Skip to content

Split storage sync P2P protocol#6262

Closed
martintomazic wants to merge 6 commits intomasterfrom
martin/feature/split-storagesync-p2p
Closed

Split storage sync P2P protocol#6262
martintomazic wants to merge 6 commits intomasterfrom
martin/feature/split-storagesync-p2p

Conversation

@martintomazic
Copy link
Copy Markdown
Contributor

@martintomazic martintomazic commented Jul 10, 2025

Closes #5751, blocked by #6270.

Likely solves root cause of #5750.

Open for discussion:

  1. E2e test to be more confident in integration part, even though existing integration test was helpful at catching bugs.
    • Cons: Additional config parameter, e2e test, and wrapping on the node side that delegates to appropriate client.
    • Alternatives:
      • Adapt oasis tesnet runner to receive path to "legacy" binary (would also require to change CI).
      • Bootstrap a few nodes on the testnet and manually monitor p2p dashboard before release.
  2. Extending rpc client (458f76f) with fallback protocol is likely the way to go.
    • Cons: Since existing storage sync has few methods, it might be simpler for the new clients to take the legacy client and fallback to it manually, avoiding the need for this change and additional fallback manager.
  3. Should new protocols alias legacy protocol definitions/methods given they are semantically equivalent?
    • Pros: Less duplication && less type fighting.
    • Cons1: When removing the legacy protocol, things will have to be copied again (double work).
    • Cons2: In a way not aliasing protects against accidental changes in one manifesting in other.

@netlify
Copy link
Copy Markdown

netlify bot commented Jul 10, 2025

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 96be6bd
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/6882189d76fc1800085fd0f1

@martintomazic martintomazic force-pushed the martin/feature/split-storagesync-p2p branch 3 times, most recently from e7dc2b2 to 230747b Compare July 14, 2025 11:12
Copy link
Copy Markdown
Contributor Author

@martintomazic martintomazic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to reach consensus on 1...3 from the context before proceeding.

Personally, I think only e2e test would be simplest and make me most confident. There would be some boilerplate though + more resources on the CI, although it will be removed eventually.

Otherwise totally in for making our code testable via mocks etc...

Comment on lines +147 to +149
// Test diff sync protocol.
diffClient := diffsync.NewClient(host, chainContext, runtimeID)
time.Sleep(1 * time.Second)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already flaky in the CI. Ideally there should be explicit synchronization or increase the timeout.

Let's see if we opt for e2e only with config.

@martintomazic martintomazic marked this pull request as ready for review July 14, 2025 11:42
@martintomazic martintomazic requested a review from kostko July 17, 2025 07:58
func init() {
peermgmt.RegisterNodeHandler(&peermgmt.NodeHandlerBundle{
ProtocolsFn: func(n *node.Node, chainContext string) []core.ProtocolID {
if !n.HasRoles(node.RoleComputeWorker | node.RoleStorageRPC) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should add observer nodes as well (also elsewhere).

func (bm *backendMock) GetCheckpoints(_ context.Context, request *checkpoint.GetCheckpointsRequest) ([]*checkpoint.Metadata, error) {
cpVersion = request.Version
root = node.Root{
// Existing bug: storagesync p2p server does not set namespace, nor does checkpoint.ChunkProvider validates it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be considered a bug, because the protocol is always per runtime so the namespace is implicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will simply ignore this like other part of the code do.

Possibly I could add a comment to the api request definition that this filed is redundant/not used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use cbor codec for grpc, omitting unused field should be both forward and backward compatible? So possibly we could remove it altogether (separate PR of course). Else I will just leave a comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean remove it? The same root structure is used in places where it needs the namespace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just leave a comment for now.

Copy link
Copy Markdown
Contributor Author

@martintomazic martintomazic Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean remove it? The same root structure is used in places where it needs the namespace.

I meant removing the namespace from request (backend) not from root:

Namespace common.Namespace `json:"namespace"`

Update: or update all callsites to pass it, add new field to storage/checkpoint sync request and validate in the backend (omitting validation in case of zero value not to be breaking).

Will go with comment for this PR as this would be out of scope. :)

@kostko
Copy link
Copy Markdown
Member

kostko commented Jul 17, 2025

Likely solves root cause of #5750.

Note that as I see it currently, you always register for the checkpoint sync protocol, even if the checkpointer is disabled? In this case, this alone will not solve the above issue.

@kostko
Copy link
Copy Markdown
Member

kostko commented Jul 17, 2025

E2e test to be more confident in integration part, even though existing integration test was helpful at catching bugs.

Yeah that could become complicated fast and the e2e suite likely needs a major refresh so I would do it separately.

Should new protocols alias legacy protocol definitions/methods given they are semantically equivalent?

I think copying is the safer option.

@martintomazic
Copy link
Copy Markdown
Contributor Author

martintomazic commented Jul 17, 2025

Note that as I see it currently, you always register for the checkpoint sync protocol, even if the checkpointer is disabled? In this case, this alone will not solve the above issue.

Nice catch!

I guess correct solution to it is the complementary solution from the #5750 (comment)

RegisterProtocol in addition to triggering tracking and peer management for the given protocol (fetch part), is also responsible for triggering background advertisement (serve part). Ideally, advertisement would be RegisterProtocolServer responsibility.
We could keep it simple and start advertising only after sufficient number of checkpoints. Likely we could generalize this and propagate generic readiness closure up to advertisement's retry.

^^ or independent and then only register if checkpointer is enabled.

update: This has to be independent method. E.g. registering light client protocol server should not start advertisement as peers are found here by other means.

@martintomazic martintomazic force-pushed the martin/feature/split-storagesync-p2p branch 2 times, most recently from a9bd6b5 to 462a7eb Compare July 18, 2025 14:30
@martintomazic
Copy link
Copy Markdown
Contributor Author

martintomazic commented Jul 18, 2025

Let's get #6270 before this one to also tackle #6262 (comment) here.

Will handle remaining threads later on.

@martintomazic martintomazic marked this pull request as draft July 18, 2025 14:38

// NewClient creates a new RPC client for the given protocol.
func NewClient(h host.Host, p protocol.ID) Client {
func NewClient(h host.Host, p protocol.ID, fallback ...protocol.ID) Client {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to accept pids []protocol.ID because go-libp2p implementation doesn't guarantee that the first ID will be used if possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just syntactic sugar for slice? The main problem as you point out is that interface does not promise order.

In practice, implementation favors protocols from left to right.

Ideally, we should not rely on this. If we had many methods as part of our protocol likely this would be the only sensible approach. But we don't, so maybe we can make it simpler.

}
}

func (s *service) handleGetDiff(ctx context.Context, request *GetDiffRequest) (*GetDiffResponse, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *service) handleGetDiff(ctx context.Context, request *GetDiffRequest) (*GetDiffResponse, error) {
func (s *service) handleGetDiff(ctx context.Context, req *GetDiffRequest) (*GetDiffResponse, error) {

}

// NewServer creates a new storage diff protocol server.
func NewServer(chainContext string, runtimeID common.Namespace, backend api.Backend) rpc.Server {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid returning interface, I would create function NewService and let the caller create rpc.Server. Or create 2 functions.

)

// Client is a checkpoints sync protocol client.
type Client interface {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this interface?

// Warning: This client only registers the checkpoint sync protocol with the P2P
// service. To enable advertisement of the legacy protocol, it must be registered
// separately.
func NewClient(p2p rpc.P2P, chainContext string, runtimeID common.Namespace) Client {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move constructor after struct type definition.

// separately.
func NewClient(p2p rpc.P2P, chainContext string, runtimeID common.Namespace) Client {
pid := protocol.NewRuntimeProtocolID(chainContext, runtimeID, CheckpointSyncProtocolID, CheckpointSyncProtocolVersion)
fallbackPid := sync.GetStorageSyncProtocolID(chainContext, runtimeID)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prepare PR which will clean these things.

}

func (c *client) getBestPeers(opts ...rpc.BestPeersOption) []core.PeerID {
return append(c.mgr.GetBestPeers(opts...), c.fallbackMgr.GetBestPeers(opts...)...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this will work as expected. Maybe we should have just 2 clients, everyone with its own protocol ID and then we can switch to the legacy one in methods like GetCheckpointChunk if we don't get an answer from the first one.

Copy link
Copy Markdown
Contributor Author

@martintomazic martintomazic Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think the test confirms it works, but indeed this is probably simpler and thus better, given that we only have 3 methods to fallback manually. Will think a bit after rebasing this PR on top of #6270.

In general this would also enable proper e2e testing (we could remove existing integration test). E2e test can be registered to not execute by default and used later as sanity again when removing support for both protocols.

Probably to make this work we should refactor existing storage and checkpoint sync to avoid returning custom Checkpoint type that wraps metadata and slice of peer feedbacks, but instead return slice of metadata with slice of slices of peer feedback. See. Or handle this type discrepancy on the call site. Uglier I would want but doable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this in new PR #6277

Checkpoints []*checkpoint.Metadata `json:"checkpoints,omitempty"`
}

// Constants related to the GetCheckpointChunk method.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every constant should have its own comment.

@martintomazic martintomazic force-pushed the martin/feature/split-storagesync-p2p branch from 462a7eb to 32a6ad1 Compare July 23, 2025 13:34
@martintomazic
Copy link
Copy Markdown
Contributor Author

Will consider simplifying with 2 clients again and then I tag you all for a final review :).

Avoid global state to enable writting unit tests with
multiple peers in the same process.
Legacy storage sync protocol is still advertised and served
to enable seamless rolling upgrades of the network.
Maybe I will replace this with e2e alltogether.
To be consistent I should probably extend other
p2p clients. On the other hand I dislike methods
for testing only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate storage checkpoint sync p2p protocol

3 participants