Skip to content

aura/slot_based: Fix effective slot deadline using relay parent offset#11453

Open
lexnv wants to merge 11 commits intomasterfrom
lexnv/fix-authoring-blocks
Open

aura/slot_based: Fix effective slot deadline using relay parent offset#11453
lexnv wants to merge 11 commits intomasterfrom
lexnv/fix-authoring-blocks

Conversation

@lexnv
Copy link
Contributor

@lexnv lexnv commented Mar 20, 2026

This PR addresses two issues with the collators building blocks:

  • Issue 1: Skipping the first block of the slot:
    • Wall clock (slot) 803, parachain slot 803
    • last reported slot is 803, next slot 804, remaining 491ms => deadline = 491 - 1000 = 0 (skipps the block)
  • Issue 2: Wrongfully building the first block of the next slot:
    • Wall clock 805, parachain slot 803 (relay chain not yet imported, same parent same paraslot outdated)
    • last reported slot is 805, next slot 806, remaining 6s, deadline = 6s - 1s = 5s => builds block

When issue 1 happens, the collator A is building less blocks than expected degrading the block times.

When issue 2 happens, the collator A is competing (outside of his slot) with another collator B. Since both collators are building their blocks at roughly the same time, it is a matter of chance which one gets backed first by the relay chain.

However, in 95% of incidents, collator B's block will get backed by the relay chain.
Collator A is at the end of his slot (one block past it, in fact), while collator B has a fresh connection with the backing group. One theory is that connections might degrade over time and that could explain why collator B block gets backed 95% of incidents.

This exposes the Issue 3 (unaddressed in this PR):

  • T0: Collator A advertises its block 0xA (build wrongfully)
  • T0: Collator B advertises at the same height the block 0xB
  • T1: Collator B imports the block 0xA as best then starts building 9 other blocks on this fork
  • T2: Relay chain backs 0xB invalidating and degrading the block confidence

Root Cause

Because we use an RP offset=1, we only start building on top of relay_parent=0xa052…1016 relay_parent_num=30397129 when #30397130 (0xa052…1016 → 0x201f…fc48) gets imported.

Node 3 was building properly 10 blocks, skipping 2. Then on the next block production opportunity, the relay block import races with building:

  • at 10:20:31.002 we start building the 11 block outside our slot because we still build on top of 30397128
  • at 10:20:31.017 node imports relay 30397130 which would have seen best as 30397129 and detect the paraslot change

Logs

Node 3

// Skips first block opportunity
10:20:24.501
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000) 
  aura_slot=Slot(295623803) relay_parent=0xf9700b74 relay_parent_num=30397128
    slot=Slot(295623803)
  duration_until_next_slot=491.263683ms
  Adjusted proposal duration. duration=None
  
10:20:25.002
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804) relay_parent=0xf9700b74 relay_parent_num=30397128
    slot=Slot(295623803)
  duration_until_next_slot=5.992199112s 
  Adjusted proposal duration. duration=Some(493ms)

// Builds 9 more blocks

// Skips 2 blocks
10:20:30.000
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804)  relay_parent=0xf9700b74 relay_parent_num=30397128
  duration_until_next_slot=993.059675ms
  Adjusted proposal duration. duration=None
  
10:20:30.500
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804) relay_parent=0xf9700b74 relay_parent_num=30397128
  duration_until_next_slot=493.092078ms
  Adjusted proposal duration. duration=None

// The issue happens here, aura_slot is the wall clock which gets incremented
// while the relay parent stays fixed and results into paraslot 803
// when in fact relay has changed and this node did not see it yet.
//
// Sufficient time to build: allows building wrongfully

10:20:31.002
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623805) relay_parent=0xf9700b74 relay_parent_num=30397128
       slot=Slot(295623803)
  duration_until_next_slot=5.991286702s
  Adjusted proposal duration. duration=Some(492ms)

Node 1 Imports

10:20:24.419 [Relaychain] 🏆 Imported #30397129 (0xf970…0b74 → 0xa052…1016)
10:20:30.960 [Relaychain] 🏆 Imported #30397130 (0xa052…1016 → 0x201f…fc48)
10:20:36.522 [Relaychain] 🏆 Imported #30397131 (0x201f…fc48 → 0xd6b3…5527)

Node 3 Imports

10:20:24.378 [Relaychain] 🏆 Imported #30397129 (0xf970…0b74 → 0xa052…1016)
10:20:31.017 [Relaychain] 🏆 Imported #30397130 (0xa052…1016 → 0x201f…fc48)
10:20:36.524 [Relaychain] 🏆 Imported #30397131 (0x201f…fc48 → 0xd6b3…5527)

This has been detected using:

Testing Done

  • have added unit tests
  • deployment in progress to double check block confidence

cc @sandreim @skunert

lexnv added 5 commits March 20, 2026 12:03
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this Mar 20, 2026
@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Mar 20, 2026
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@bkchr
Copy link
Member

bkchr commented Mar 20, 2026

  • T5: Relay parent backs 0xC3

But this would be correct? You probably mean the other one?

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last time I look at such AI slop. IT IS YOUR JOB TO LOOK OVER THIS CODE BEFORE OPENING A PR. Next time I see such a pr, I will just close it.

Reading the explanation is extremely hard to follow what is going on. The changes are just "wild". If I understand it correctly and it is about the relay parent offset, wouldn't it be much simpler to directly use remove the relay parent offset from duration_now? Then we don't need to adjust the slot later on and it should simplify this pr drastically.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Contributor Author

lexnv commented Mar 23, 2026

Yep Im going to rethink the PR and update the description, believe we can trim it down to a few lines🙏

Over the weekend:

  • polkadot YAP went from 97.38 to 97.81%: improved by 0.43%
  • kusama YAP went from 98.8 to 99.66%: improved by 0.86%

lexnv added 4 commits March 23, 2026 12:32
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
to none

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv requested review from bkchr and skunert March 24, 2026 14:48
Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now, left one comment we should quickly discuss, but then ready to approve.

This case 1 that is mentioned here took me some mental gymnastics to get right again, even though its not so complicated.
Came up with this illustration to clarify:

Image

hash: Option<H256>,
/// True if the collator built a block for the current relay parent, false otherwise.
///
/// This state is needed, otherwise the opportunity 1 might mark the block as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is true. Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway. If we run into this situation it should not matter whether this has triggered 🤔 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway.

In this case, the same collator that sees opportunity 1 is the collator that will build in the next wall slot (going through all the rest of the block building opportunities). Then if we ignore the has_built flag, we are effectively short-circuiting the building of 10 blocks in wall slot 804 and para slot 803 🤔

Have added a new diagram here for the race condition: #11453 (comment)

@lexnv
Copy link
Contributor Author

lexnv commented Mar 25, 2026

Screenshot 2026-03-25 at 14 26 15

This diagram shows the race conditions:

  • (issue 1 from pr description): Wall clock is 803 and parachain slot is 803: The collator A attempts to build this block, but building is skipped because the next wall clock 804 arrives in ~490ms and we have the 1s drift
  • Then the collator A builds 10 more blocks in Wall clock 803 and parachain slot 803 (expected behavior)
  • Then the collaotr A skips building 2 blocks because of the 1s drift (expected)
  • (issue 2 from pr description): Wall clock is now 805 and parachain slot is still 803 (a new block hasnt been imported yet): Collator A wrongfully builds this block (14th attempts in total)

//
// [ wall slot | para slot | next wall slot]
// opportunity 1: [ 803 | 803 | 490ms ]
// - The wall slot is behind the para slot deduced by the relay block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both the same here?

// opportunity 1: [ 803 | 803 | 490ms ]
// - The wall slot is behind the para slot deduced by the relay block
// - The next slot 804 arrives in 490ms leaving no room for the 1s authoring duration
// - collator must skip the building the first block for this relay block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly, when we first deduce para slot 803, it is overlaped in wall slot 803

This leaves us ~490ms until next wall slot 804 arrives, and because it is within the 1s of the next slot it gets skipepd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are saying. Your example in the code doesn't make sense to me nor your explanation. Why do we only have 490ms?

// the wall slot ticks.
// - We don't want to build on this relay parent and instead skip until the next relay
// block arrives.
struct ParentTracker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved outside of this function (the type declaration)

has_terminated: bool,
}

let mut parent_tracker =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it not just enough to check that the block number of the relay chain is strictly increasing? So, we do not build on the same block twice?

Copy link
Contributor Author

@lexnv lexnv Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I got the idea right:

  • We use a block number that we set when finished building the 10 blocks for this slot, because we reach the 1s limit at the end of the slot (this signals we are at the end of the slot and the block is terminated)
  • Therefore, when the next wall slot arrives, the relay block on which we are building must be strictly greater than the block we just set, otherwise we are building on the same stale parent

Does this guard against the first case? The scenario where the first block opportunity is skipped for the same wall slot and para slot? 🤔

Also it might have a tiny race with reorgs? Maybe I got the idea wrong:

// We use a simple number to detect when we terminated with the block production
let mut last_terminated_relay_number: Option<u32> = None;
...

let relay_parent = rp_data.relay_parent().hash();
let relay_parent_header = rp_data.relay_parent().clone();


// Could re-org with a different parent, but mostly ok since that doesnt happen that often?
if last_terminated_relay_number >= Some(relay_parent_header.number()) {
  continue;
}

...

let Some(adjusted_authoring_duration) = adjusted_authoring_duration else {
  // But this case doesnt guard against the first opportunity
  last_terminated_relay_number = Some(relay_parent_header.number());
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants