aura/slot_based: Fix effective slot deadline using relay parent offset#11453
aura/slot_based: Fix effective slot deadline using relay parent offset#11453
Conversation
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
But this would be correct? You probably mean the other one? |
bkchr
left a comment
There was a problem hiding this comment.
This is the last time I look at such AI slop. IT IS YOUR JOB TO LOOK OVER THIS CODE BEFORE OPENING A PR. Next time I see such a pr, I will just close it.
Reading the explanation is extremely hard to follow what is going on. The changes are just "wild". If I understand it correctly and it is about the relay parent offset, wouldn't it be much simpler to directly use remove the relay parent offset from duration_now? Then we don't need to adjust the slot later on and it should simplify this pr drastically.
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
to none Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
| hash: Option<H256>, | ||
| /// True if the collator built a block for the current relay parent, false otherwise. | ||
| /// | ||
| /// This state is needed, otherwise the opportunity 1 might mark the block as |
There was a problem hiding this comment.
I am wondering if this is true. Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway. If we run into this situation it should not matter whether this has triggered 🤔 .
There was a problem hiding this comment.
Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway.
In this case, the same collator that sees opportunity 1 is the collator that will build in the next wall slot (going through all the rest of the block building opportunities). Then if we ignore the has_built flag, we are effectively short-circuiting the building of 10 blocks in wall slot 804 and para slot 803 🤔
Have added a new diagram here for the race condition: #11453 (comment)
| // | ||
| // [ wall slot | para slot | next wall slot] | ||
| // opportunity 1: [ 803 | 803 | 490ms ] | ||
| // - The wall slot is behind the para slot deduced by the relay block |
| // opportunity 1: [ 803 | 803 | 490ms ] | ||
| // - The wall slot is behind the para slot deduced by the relay block | ||
| // - The next slot 804 arrives in 490ms leaving no room for the 1s authoring duration | ||
| // - collator must skip the building the first block for this relay block |
There was a problem hiding this comment.
Yep exactly, when we first deduce para slot 803, it is overlaped in wall slot 803
This leaves us ~490ms until next wall slot 804 arrives, and because it is within the 1s of the next slot it gets skipepd
There was a problem hiding this comment.
I don't understand what you are saying. Your example in the code doesn't make sense to me nor your explanation. Why do we only have 490ms?
| // the wall slot ticks. | ||
| // - We don't want to build on this relay parent and instead skip until the next relay | ||
| // block arrives. | ||
| struct ParentTracker { |
There was a problem hiding this comment.
This should be moved outside of this function (the type declaration)
| has_terminated: bool, | ||
| } | ||
|
|
||
| let mut parent_tracker = |
There was a problem hiding this comment.
Isn't it not just enough to check that the block number of the relay chain is strictly increasing? So, we do not build on the same block twice?
There was a problem hiding this comment.
Let me know if I got the idea right:
- We use a block number that we set when finished building the 10 blocks for this slot, because we reach the 1s limit at the end of the slot (this signals we are at the end of the slot and the block is terminated)
- Therefore, when the next wall slot arrives, the relay block on which we are building must be strictly greater than the block we just set, otherwise we are building on the same stale parent
Does this guard against the first case? The scenario where the first block opportunity is skipped for the same wall slot and para slot? 🤔
Also it might have a tiny race with reorgs? Maybe I got the idea wrong:
// We use a simple number to detect when we terminated with the block production
let mut last_terminated_relay_number: Option<u32> = None;
...
let relay_parent = rp_data.relay_parent().hash();
let relay_parent_header = rp_data.relay_parent().clone();
// Could re-org with a different parent, but mostly ok since that doesnt happen that often?
if last_terminated_relay_number >= Some(relay_parent_header.number()) {
continue;
}
...
let Some(adjusted_authoring_duration) = adjusted_authoring_duration else {
// But this case doesnt guard against the first opportunity
last_terminated_relay_number = Some(relay_parent_header.number());
}

This PR addresses two issues with the collators building blocks:
When issue 1 happens, the collator A is building less blocks than expected degrading the block times.
When issue 2 happens, the collator A is competing (outside of his slot) with another collator B. Since both collators are building their blocks at roughly the same time, it is a matter of chance which one gets backed first by the relay chain.
However, in 95% of incidents, collator B's block will get backed by the relay chain.
Collator A is at the end of his slot (one block past it, in fact), while collator B has a fresh connection with the backing group. One theory is that connections might degrade over time and that could explain why collator B block gets backed 95% of incidents.
This exposes the Issue 3 (unaddressed in this PR):
Root Cause
Because we use an RP offset=1, we only start building on top of
relay_parent=0xa052…1016 relay_parent_num=30397129when#30397130 (0xa052…1016 → 0x201f…fc48)gets imported.Node 3 was building properly 10 blocks, skipping 2. Then on the next block production opportunity, the relay block import races with building:
10:20:31.002we start building the 11 block outside our slot because we still build on top of3039712810:20:31.017node imports relay 30397130 which would have seen best as30397129and detect the paraslot changeLogs
Node 3
Node 1 Imports
Node 3 Imports
This has been detected using:
Testing Done
cc @sandreim @skunert