You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
143615: raft: replace MsgStorageApply protocol r=tbg a=pav-kv
Raft interaction with storage mainly consists of two protocols: `MsgStorageAppend` and `MsgStorageApply`. This PR focuses on the latter. The protocol is replaced with a more ergonomic / strongly-typed API, which also provides the user with a better / direct control of the apply flow.
---
The old flow:
1. If there are unapplied committed entries, `RawNode.Ready()` constructs a `MsgStorageApply` including some/all of these entries.
2. The `MsgStorageApply` is included in `Ready.Messages`.
3. The application filters it out from other messages, and acts on it.
4. When the job is done, the `RawNode` must be notified. The `MsgStorageApply` contains a `MsgStorageApplyResp` in its `Responses` field, which must be stepped back to `RawNode`.
Downsides of this approach:
- The decision of fetching entries from storage and constructing `MsgStorageApply` is made by raft. This necessitated some form of built-in flow control, but we never ended up using it: #143576. It would be better to have ability to control this flow from outside raft.
- `MsgStorageApply` construction and the associated IO happens under `Replica.{raftMu+mu}` during the `Ready()` call. We would like to avoid IO when holding `Replica.mu`: #140235.
- The raft messaging system is slightly abused with this approach. All raft messages contain the extra `Responses` [field](https://github.com/cockroachdb/cockroach/blob/acb74317523b0a0849d827a968167a9243e2bb88/pkg/raft/raftpb/raft.proto#L109-L112) that only the two storage APIs use. Most raft messages are external, and don't require order/delivery, while the `MsgStorageAppend/Apply` protocols require strict order.
- Storage interaction being encoded in a `raftpb.Message` with many unused fields is simply confusing and not a great API. The preference is to have dedicated types with stronger/narrower semantics.
---
The downsides are addressed by the new flow:
1. `RawNode.Ready()` includes a `Ready.Committed LogSpan`, to signify the committed but not applied span of the log.
2. The caller obtains `RawNode.LogSnapshot()` and has the freedom of fetching/applying the entire committed span or a prefix. It also can do so after releasing `Replica.mu`.
3. When applied, the caller notifies using the `RawNode.AckApplied()` method.
---
Part of #143652
Related to #124440
Epic: CRDB-46488
Release note: none
Co-authored-by: Pavel Kalinnikov <[email protected]>
0 commit comments