CIP-0164 | Refine Leios protocols based on buidlerfest discussions#1167
CIP-0164 | Refine Leios protocols based on buidlerfest discussions#1167rkuhn wants to merge 1 commit intocardano-foundation:masterfrom
Conversation
coot
left a comment
There was a problem hiding this comment.
There is a semantic problem in the LeiosAnnounce mini-protocol. There are two MsgLeiosAnnounceRequestNext messages which start with different agencies: once it starts with the agancy on the Client the other one starts with agancy on the server - the meaning of the first is fine - the client asks for a number of announcements, but the latter means that the server is asking itself for a number of announcements.
I don't think we actually need the MsgLeiosAnnounceRequestNext :: StBusy -> StBusy. The client can just use protocol pipelining to send MsgLeiosAnnounceRequestNext :: StClient -> StBusy ahead of receiving all responses from its previous request.
|
Hi @coot, sorry for the misunderstanding, I should have tried to figure out a different color right away: the RequestNext(N) message is only sent by the client; in state StBusy both sides have agency. This is the same as with pipelining, I just prefer to make it explicit and specify it rather than leaving implicit. Please do read the RS background I linked to. The intended usage is that the client will ask for 1000 items (which I think is roughly the right depth for votes), and while receiving them it will request the next 100 after the first 100 are received. The goal is to always have sufficient demand signaled at the server that it can send right away. This is important because we need low latency for tiny messages. The protocol still places a strict upper bound on required receive buffer size (which would be roughly 100-200kB for 1000 votes, as I've been told). Yes, a similar result could be achieved by round-robin multiplexing 1000 instances of the protocol onto that mini-protocol's aubchannel, but it would be a lot less efficient (more messages sent, memory and processing overhead at both client and server). |
That goes against the whole principle of the state-agency / typed protocols scheme I'm afraid. I understand what you're aiming for - it's similar to TCP window - but I think as @coot says the existing pipelining does most of what you want, and for votes the responses can carry multiple anyway. |
There was a problem hiding this comment.
Thanks @rkuhn for bringing this & the discussion into the CIP stream. Mainly the CIP editors will be looking for consensus among Leios architects & agreement from other stakeholders and merge this if & when that appears to happen... so please feel free to point out the details that warrant the most consideration & that should be resolved before we merge.
Next step would be to confirm this as a proper update at the next CIP meeting in Triage (https://hackmd.io/@cip-editors/131) so please anyone for or against these changes would be welcome to attend for a light introduction (not a full review)... just enough to establish validity & general consensus behind the update.
cc @will-break-it @ch1bo @nfrisby @bwbush @WhatisRT @nhenin @dnadales @perturbing @jpraynaud
|
@rphair I specifically asked for having the conversation here on a PR and expect some more discussion on this over the next few weeks, before we'll be seeking for consensus and merge it into the CIP. Does this work for you and the CIP auditors? |
|
of course @ch1bo - there's no rush & the editors would be following your (plural) lead & waiting for you to post your own conclusions at your own pace. We'll just have a look at the CIP meeting next week so all the editors know what's ahead & not do anything with it until the stakeholding reviewers provide our consensus. |
- use Reactive Streams semantics for bounded push of block announcements and votes - remove votes offer & request communication cycles to cut down latency - remove probably premature optimisation of block / txn request by just listing txn indices - split LeiosNotify into LeiosAnnounce (for EBs), LeiosVotes, LeiosBlockNotify (for when EB and/or txns are available upstream) to allow independent treatment in the muxer and remove or reduce head-of-line blocking We might also want to allocate N2N mini-protocol IDs in this CIP because multiple teams are starting to play with this spec and might want to check interoperability. Signed-off-by: Roland Kuhn <rk@rkuhn.info>
5de19c1 to
a04b10d
Compare
The protocol is still typed and can still be usefully formalised, but yes, modelling the real behaviour does require mixed choice, which is not currently available in the machinery used in the Haskell implementation. This is semantically already true of the existing protocols, where in StCanAwait of chain sync the initiator may send RequestNext even though it has “no agency” — this is what you call pipelining, but since all “sessions” share the same communication channel without tagging their messages, I think it is fair to consider those pipelining instances just a matter of perspective, like the choice of gauge in quantum field theory. What I’d like to achieve here is that we define a useful network protocol befitting the Leios information flow requirements, not moulded to or limited by the existing code structure of other mini-protocols in a particular Cardano node implementation. It is also quite easy to conform to my proposed protocol using the existing Haskell machinery by applying the pipelining approach to run a number of instances of the protocol with a fixed choice of N (e.g. N=1 to get normal request–response). This requires slightly more resources within the node but saves complexity cost in the Haskell codebase by reusing existing infrastructure.
My proposal changes the MsgLeiosVote to carry exactly one vote because then the message has a predictable cost. Allowing a list of votes within a single message means that the responder can cause unpredictable resource usage — in my proposal an initiator may ask for 1000 votes after which it expects to receive no more than 1000 votes (until it asks for more later). I appreciate that this deviates from the trodden path in Cardano, and I’d like to ask you to consider this with an open mind given that Reactive Streams have been around for more than a decade in the JDK, meaning that this communication principle is well proven. |
| size of the configured TCP send and receive buffers. | ||
|
|
||
| While the node is catching up with the chain after a restart, it will see Praos | ||
| blocks referencing EBs and use the MsgLeiosMultiBlockRequest to get not only |
There was a problem hiding this comment.
While we want a multi-block request for everything to bulk download on catch-up, we still need the body-only request/response (MsgLeiosBlockRequest and MsgLeiosBlock before). While this might be generalized to be multiple bodies, a caught up node would not download the full block closure as only in the worst case all txs are unique and not known by the client already.
@nfrisby You also discussed a "cancel" semantics of this protocol last week? That would be an alternative to having a two-stage download: it could be used to always request the full closure and cancel / override with a desired closure subset via MsgLeiosBlockTxsRequest (or even switch to a "fresher" block download).
There was a problem hiding this comment.
My idea here is to deliver the block body for MsgLeiosBlockTxsRequest(hash, []) and an array of transactions for MsgLeiosBlockTxsRequest(hash, [<txHashes>]).
Cancellation of an ongoing transfer doesn’t seem fruitful (I am happy to be proven wrong by experiment!) because the RTT-bandwidth-product P will typically not be much smaller than the maximum response size: the responder can only stop sending upon receiving the cancellation, meaning after sending the cancellation the initiator will still receive P bytes.
|
Brief discussion at the CIP meeting today introduced the dialogue here & confirmed my earlier #1167 (review) suggesting the existing discussion here simply take its natural course until apparently resolved by the participants. |
nfrisby
left a comment
There was a problem hiding this comment.
Happy to see another contributor to the CIP! 👏
Even with fait accompli and local sortition (so that some votes are bigger than others), I think it's plausible that micromanaging votes isn't worthwhile, so just sending them instead of offering them individually is very plausibly a latency optimization.
For the other changes, it's not yet clear to me that they're improvements; I commented on those parts of the PR in this review.
| </div> | ||
|
|
||
| The purpose of this first protocol is to diffuse block announcements as fast as | ||
| possible throughout the network. Since these announcements are small and |
There was a problem hiding this comment.
https://www.reactive-streams.org/ says:
The main goal of Reactive Streams is to govern the exchange of stream data across an asynchronous boundary—think passing elements on to another thread or thread-pool—while ensuring that the receiving side is not forced to buffer arbitrary amounts of data
The way we have achieved that so far in Cardano node is to use the typed-protocols framework combined with what we call "mini protocol pipelining." I'm hesitant to explicit bake that pipelining into the definition of the mini protocol itself (since it's an technically just an optimization). But I do think the CIP text should probably discuss the mini protocol pipelining in more detail than the existing (on main) "Because the client only has agency in one state, it can pipeline its requests for the sake of latency hiding" sentence.
See this excerpt from page 18 of https://ouroboros-network.cardano.intersectmbo.org/pdfs/network-design/network-design.pdf for a brief discussion.
You could also watch Duncan's 2019 presentation here, https://www.youtube.com/watch?v=kkynmgwa7gE. He starts discussing mini protocol pipelining at the 35m55s mark.
My summary:
- Every single message in the LeiosNotify mini protocol (from the
mainbranch) is tiny. So we don't really care that they're technically different sizes. So we can say "not forced to buffer arbitrary amounts of messages" is a fine substitute to "not forced to buffer arbitrary amounts of data" for LeiosNotify. - The agency status is always trivial: client sends one message, then the server sends a response, and loop. This makes it easy to write the pipelined version of this mini protocol, in which the client sends 1000 MsgLeiosNotificationRequestNext messages at the start of the connection and then conceptually sends another each time it receives a reply (or smooth that out with a low-high watermark, etc).
There was a problem hiding this comment.
Right now, the only distinction I'm currently seeing between the content on the main branch and this CIP in terms of the LeiosNotify mini protocol(s) is:
- You're batching the 1000 and/or 100 of MsgLeiosNotifyRequestNext messages into one message. Since we are expecting 1000s of messages for each EB, it might be worth it to add this kind of "batching" in
typed-protocols(could be a mux-level "trick" in the decoding, maybe?). Which could look something like the explicit token counting you have on this PR. - You've separated block announcements, vote announcements, and block offers into separate mini protocols. I suppose that might be useful so that the requests for each could be counted differently? But do we need to count them separately? (FWIW, I don't already think this would be easy to do on top of
typed-protocols.) - You've also inlined votes into their announcement: nodes just sent the vote without offering it first. If notes are barely bigger than their names, then that seems plausible to me.
edit: I forgot to add "Is that right or am I overlooking some difference in the LeiosNotify changes?"
There was a problem hiding this comment.
Thanks for the review, @nfrisby! Your summary of the changes to LeiosNotify is correct. The main motivation for request batching is to conserve resources because Leios involves many more notifications than Praos. I’ll respond to @coot’s comment below regarding how to model that in terms of the existing protocol framework, we seem to be converging (and I’d love to have a more formal description of pipelining in this particular context, which I’m happy to contribute as well).
The reasoning behind splitting the notify protocol into three pieces is that we’ll have three data streams that we need to back-pressure, and if my understanding is correct then the latency requirements and processing behaviour are different:
- block announcements shall be disseminated as quickly as possible
- votes are processed differently and thus may experience back-pressure while announcements do not
- block offers (incl. closures) are processed again differently, hence the separate back-pressure signal
This separation also simplifies the sending side because it can just send messages down separately back-pressured channels and let the existing multiplexer do its job. Otherwise, another level of multiplexing needs to be implemented to prioritise the block announcements.
| | Client→ | MsgLeiosMultiBlockRequest | list of EB hashes | Requests the EBs and all referenced transactions for the given EB hashes | | ||
| | ←Server | MsgLeiosBlock | EB block, list of transactions | A block requested in the previous MsgLeiosMultiBlockRequest | | ||
| | ←Server | MsgLeiosNoMoreBlocks | $\emptyset$ | All blocks from the previous MsgLeiosMultiBlockRequest have been delivered | | ||
| | Client→ | MsgLeiosBlockTxsRequest | EB hash, list of integers | For the referenced EB, request a list of transactions identified by their sequence number within that EB | |
There was a problem hiding this comment.
I'm guessing this change the one the PR description describes as
remove probably premature optimisation of block / txn request by just listing txn indices
Is that right?
An EB might contain ~15000 txs (that's 512 kB divided by 34+2 B), and a node might need to request the majority of them. With either design, then node has two choices.
- Request the whole EB.
- Or request whichever txs are actually needed by sending a request for those tx positions.
There will be some sweet spot, but since individual txs can be 16 kB, the sweet spot plausibly involves requesting the vast majority of txs in the EB but not all of them.
Roaring Bitmap versus Simple Integer Sequence
Assuming CBOR, the following table is the size of each integer encoding.
| Interval | CBOR Bytes per Number | Size of Interval |
|---|---|---|
| 0 - 23 | 1 | 24 |
| 24 - 255 | 2 | 232 |
| 256 - 14999 | 3 | 14744 |
The encoding of the sequence 0 - 14999 would be 1×24 + 2×232 + 3×14744 = 44720 B.
If we used plain bytes instead of CBOR (ie just one big CBOR bytestring), the size would be closer to 30 kB instead of 45 kB.
As a roaring bitmap, a 15000 tx request would instead be 234 full Word64 bitmaps plus 1 partial Word64 bitmap that only has 15000 - 64×234 = 24 bits set.
Each Word64 bitmap also gets an index, which would be 0 - 234 in this case, which would be 1×24 + 2×211 = 446 B in CBOR. The 234 full Word64s would be 9×234 = 2106 B. The 1 partial Word64 would either be 5 or 9; let's call it 9 B. That's a total of 2561 B. (Closer to 2000 B without CBOR, but nbd.)
So the largest request without roaring bitmaps is ~45 kB---we could reduce that to ~30 kB---and the largest request with a roaring bitmap is ~3 kB.
That's the upshot of the roaring bitmap complexity.
Is it worth it? If you juxtapose it against the size of an EB, then saving merely 27 kB per EB seems like a distraction.
But if you compare it to "all requests are tiny" versus "some requests are about half the size of a Praos block", then the roaring bitmap a very localized complexity cost that makes the design simpler to reason about in various scenarios. For example, it would justify simpler ingress buffer management: you can limit the count of requests rather than their total byte size. And even in the worst-case that would remain true. It guarantees that "individual requests are small--- approximately 2 Praos headers, so we don't need to dedicate any complexity to worrying about their overhead".
That's the argument in favor of roaring bitmaps. I don't consider myself the owner here despite authoring this section originally; it's a CIP after all! But that'd be my argument for keeping the roaring bitmap: 1) it's a bit of cruft, yeah, 2) but it's isolated here and 3) it gives a very small and tight bound on the message size---even in the worst-case---which relieves cognitive load elsewhere.
There was a problem hiding this comment.
cc @ch1bo ^^^ that's the most time I've spent trying to explain the motivation for roaring bitmaps since the original discussions last year. I know they've been on the chopping block ever since :D. So let me know what you think.
There was a problem hiding this comment.
Hehe, thanks for laying out the rationale here. I can see the appeal of requests being absolutely small. If I recall correctly, @rkuhn had an argument that any request that is orders of magnitudes smaller than the response should be fine - this would be true for both schemes.
Maybe the sweet spot lies somewhere in between? e.g. a single 15k length bitfield or that bitfield run-length encoded (so we can finally put our tech interview experience to good use)
There was a problem hiding this comment.
Thanks for the explanation! One thing I don’t yet fully grasp is the expected behaviour of the network in terms of dissemination of transactions and EBs. Why would the node ever need to request the majority or even a sizeable fraction of an EB’s transactions after catching up? Shouldn’t the transactions be disseminated somewhat earlier than the EB? Of course the transactions aren’t guaranteed to be present, but I’d be surprised if the average fraction a node would ask for is larger than 10% (if that).
There was a problem hiding this comment.
I’d be surprised if the average fraction a node would ask for is larger than 10% (if that).
This is the average case, where mempools are largely consistent. We have done recent R&D to analyse the "mempool fragmentation" of the Cardano network under various load points. Both empirically and simulation-based. See the January monthly review and March monthly review sessions, recordings and full notes here.
In summary, if demand is sufficiently high the fragmentation increases. IIRC the analytical results were confirmed in a simulation and match our intuition, but a real world test was not (yet) performed.
|
I'd propose a simpler approach which is using a mixture of batching and pipelining. Both currently used in We'd have the following states (and agencies):
And the following transitions / messages:
We can saturate the server with requests by using protocol pipelining, and save on sending each requests seprately by batching. The client will be in control how much batching / pipelining it is ready to accept on its ingress side. I don't know how to draw this using mermaid. We use a similar approach for batching block requests in |
I think the explicit state enumeration using the type-level naturals could look like this in mermaid: ---
title: LeiosVotes
---
graph LR
classDef client color:black,fill:PaleGreen,stroke:DarkGreen;
classDef server color:black,fill:PowderBlue,stroke:DarkBlue;
StIdle:::client --MsgDone--> StDone
StBusy1[StBusy 1]
StBusy2[StBusy 2]
StBusyEtc[StBusy ..]
StBusy1000[StBusy 1000]
class StBusy1,StBusy2,StBusyEtc,StBusy1000 server;
StBusy1 --MsgLeiosVote--> StIdle
StBusy1 --MsgLeiosVotesRequestNext--> StBusy2
StBusy2 --MsgLeiosVote--> StBusy1
StBusy2 --MsgLeiosVotesRequestNext--> StBusyEtc
StBusyEtc --MsgLeiosVote--> StBusy2
StBusyEtc --MsgLeiosVotesRequestNext--> StBusy1000
StBusy1000 --MsgLeiosVote--> StBusyEtc
Edit: This was not exactly the same as what @coot described above and I did another attempt here, but this is not even complete and drawing this exhaustively is a mess, I agree :) ---
title: LeiosVotes
---
graph LR
classDef client color:black,fill:PaleGreen,stroke:DarkGreen;
classDef server color:black,fill:PowderBlue,stroke:DarkBlue;
StIdle:::client --MsgDone--> StDone
StBusy1[StBusy 1]
StBusy2[StBusy 2]
StBusyEtc[StBusy ..]
StBusy1000[StBusy 1000]
class StBusy1,StBusy2,StBusyEtc,StBusy1000 server;
StBusy1 --MsgLeiosVote--> StIdle
StBusy1 --MsgLeiosVotesRequest 1--> StBusy2
StBusy1 --MsgLeiosVotesRequest 999--> StBusy1000
StBusy2 --MsgLeiosVote--> StBusy1
StBusy2 --MsgLeiosVotesRequest ...--> StBusyEtc
StBusy2 --MsgLeiosVotesRequest 998--> StBusy1000
StBusyEtc --MsgLeiosVote--> StBusy2
StBusyEtc --MsgLeiosVotesRequest ...--> StBusy1000
StBusy1000 --MsgLeiosVote--> StBusyEtc
|
|
Yes, @coot that’s also a possibility — I think your proposal with pipelining is behaviourally indistinguishable from reactive streams semantics (without cancellation), so any node implementation can pick its internal representation according to personal style and inclination. @ch1bo what you sketched is a different way of implementing reactive streams semantics, which would probably be a larger change and more difficult to integrate with the existing mini-protocol machinery. In order to ensure convergence I’d like to describe formally how I interpret pipelining. If this matches your understanding then I’ll add it to the CIP text. Protocol pipelining with a factor N runs N instances of a mini-protocol on a single multiplexer subchannel for the given protocol ID. Each instance tracks its own state and agency as per the specification. One protocol state is marked as the switch state; the switch state must be one in which the initiator has agency. The subchannel is governed by a pair of multiplexers, one for sending and one for receiving, that behave in round-robin fashion across the N instances, starting at the first instance. Requests from the node are forwarded by the sending multiplexer to the currently selected instance and sending the resulting protocol message to the network; whenever having sent a message from the switch state, the send multiplexer selects the next instance. The receive multiplexer forwards received messages from the network to the currently selected protocol instance; whenever receiving a message that transitions that instance into the switch state, the receive multiplexer selects the next instance. This implies that pipelining only works for miniprotocols which have a suitable switch state in which the initiator decides what to do next and the responder then can send one or more messages to get back to the switch state. A protocol in which the initiator would need to send again from a different intermediate state would not support pipelining (such protocols don’t yet exist in the Ouroboros family). |
|
@ch1bo what I propose has this shape:
This is wiki page, which explains protocol pipelining as I use the term. Note that one doesn't need to run N instances of a protocol; one just sends requests ahead of responses (which is enough to hide latency). For example, with the above protocol, one can have this conversation (with pipelining depth 2):
Even if the client pipelines messages, the server doesn't care - it will see requests in exactly the same order as if the client weren't pipelining at all. So protocol pipelining doesn't restrict the kinds of protocols you can use (at least in the class of protocols that one can encode as diagrams with agencies). |
|
@coot Okay, so you’re confirming that my description is correct, including the restriction that pipelining is only defined for request–response+ shaped protocols — I am not aware of the design rules or well-formedness conditions of the mini-protocols used in Cardano specs, so I’ll take the absence of other protocol shapes as specification by example. (Just as an illustration of a protocol that would not work, take a hypothetical variant of the block fetch protocol where the initiator sends a block range request, the responder sends NoBlocks or StartStreaming, and then before each Block can be transmitted the initiator needs to explicitly ask for it. This can obviously not be pipelined because the next block range request would arrive when the responder expects a NextBlockPlease message. My naive interpretation of agency diagrams would permit such a protocol to be specified.) I’ll update the PR soon with the aspects already agreed here. For the roaring bitmaps I still have to understand the solution space better — would it not also be a solution to pass the bitmap through zstd compression? |
|
Yes, that's a good example when one is not able to pipeline. To use pipeling over a message from state s to t (for s client has agency, for t server has agency) there must exist state s' with client agency, such that all paths from t lead to s' through states where the server has agency. This is because when pipelining the client needs to know ahead of time to which state the server will lead to. |

This PR resulted from a discussion today at buidlerfest, I offered to write things down and submit
here for wider discussion. The model of information flow itself remains unchanged, the refinement
only aims at making it easier to achieve low latency or high bandwidth for the different parts of
the Leios protocols, as necessary.
Summary of the changes:
We might also want to allocate N2N mini-protocol IDs in this CIP because multiple teams are starting to play with this spec and might want to check interoperability.
Signed-off-by: Roland Kuhn rk@rkuhn.info