-
Notifications
You must be signed in to change notification settings - Fork 1.2k
aura/slot_based: Fix effective slot deadline using relay parent offset #11453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
1dd1358
35ffc0d
3c730de
a963b39
e9fdc0b
68067d4
30a70de
f3a752f
fead047
dcb9520
542d7e7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,7 +51,7 @@ use sp_application_crypto::AppPublic; | |
| use sp_blockchain::HeaderBackend; | ||
| use sp_consensus::Environment; | ||
| use sp_consensus_aura::AuraApi; | ||
| use sp_core::crypto::Pair; | ||
| use sp_core::{crypto::Pair, H256}; | ||
| use sp_inherents::CreateInherentDataProviders; | ||
| use sp_keystore::KeystorePtr; | ||
| use sp_runtime::{ | ||
|
|
@@ -188,6 +188,46 @@ where | |
| collator_util::Collator::<Block, P, _, _, _, _, _>::new(params) | ||
| }; | ||
|
|
||
| // Tracker helper to ensure we do not build an extra block at slot boundaries | ||
| // with edge-cases of importing blocks after the `slot_timer` ticks. | ||
| // | ||
| // The following scenarios are degrading block confidence (happening on the same instance): | ||
| // | ||
| // [ wall slot | para slot | next wall slot] | ||
| // opportunity 1: [ 803 | 803 | 490ms ] | ||
| // - The wall slot is behind the para slot deduced by the relay block | ||
| // - The next slot 804 arrives in 490ms leaving no room for the 1s authoring duration | ||
| // - collator must skip the building the first block for this relay block | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep exactly, when we first deduce para slot 803, it is overlaped in wall slot 803 This leaves us ~490ms until next wall slot 804 arrives, and because it is within the 1s of the next slot it gets skipepd
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand what you are saying. Your example in the code doesn't make sense to me nor your explanation. Why do we only have 490ms? |
||
| // | ||
| // opportunity 2-10: [ 804 | 803 | 6s ] | ||
| // - This is the proper case where we have sufficient time to build and build 10 blocks | ||
| // | ||
| // opportunity 11-12: [ 804 | 803 | 990ms-500ms ] | ||
| // - At this point we skip building the last 2 blocks to give room for the 1s authoring | ||
| // duration before the next slot arrives. | ||
| // | ||
| // opportunity 13: [ 805 | 803 | 6s ] | ||
| // - The new relay block was not imported soon enough, therefore we detect the same relay | ||
| // parent that we just skipped. Since the import of a new block is usually ~15ms after | ||
| // the wall slot ticks. | ||
| // - We don't want to build on this relay parent and instead skip until the next relay | ||
| // block arrives. | ||
| struct ParentTracker { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be moved outside of this function (the type declaration) |
||
| /// Hash of the relay block. | ||
| hash: Option<H256>, | ||
| /// True if the collator built a block for the current relay parent, false otherwise. | ||
| /// | ||
| /// This state is needed, otherwise the opportunity 1 might mark the block as | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering if this is true. Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway. If we run into this situation it should not matter whether this has triggered 🤔 .
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In this case, the same collator that sees opportunity 1 is the collator that will build in the next wall slot (going through all the rest of the block building opportunities). Then if we ignore the Have added a new diagram here for the race condition: #11453 (comment) |
||
| /// terminated wrongfully. | ||
| has_built: bool, | ||
| /// True if the collator adjusted the slot timer and decided there is no time to build | ||
| /// a block, false otherwise. This is set to true on opportunity 11. | ||
| has_terminated: bool, | ||
| } | ||
|
|
||
| let mut parent_tracker = | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't it not just enough to check that the block number of the relay chain is strictly increasing? So, we do not build on the same block twice?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me know if I got the idea right:
Does this guard against the first case? The scenario where the first block opportunity is skipped for the same wall slot and para slot? 🤔 Also it might have a tiny race with reorgs? Maybe I got the idea wrong: // We use a simple number to detect when we terminated with the block production
let mut last_terminated_relay_number: Option<u32> = None;
...
let relay_parent = rp_data.relay_parent().hash();
let relay_parent_header = rp_data.relay_parent().clone();
// Could re-org with a different parent, but mostly ok since that doesnt happen that often?
if last_terminated_relay_number >= Some(relay_parent_header.number()) {
continue;
}
...
let Some(adjusted_authoring_duration) = adjusted_authoring_duration else {
// But this case doesnt guard against the first opportunity
last_terminated_relay_number = Some(relay_parent_header.number());
} |
||
| ParentTracker { hash: None, has_built: false, has_terminated: false }; | ||
|
|
||
| let mut relay_chain_data_cache = RelayChainDataCache::new(relay_client.clone(), para_id); | ||
| let mut connection_helper = BackingGroupConnectionHelper::new( | ||
| keystore.clone(), | ||
|
|
@@ -240,6 +280,28 @@ where | |
| let relay_parent = rp_data.relay_parent().hash(); | ||
| let relay_parent_header = rp_data.relay_parent().clone(); | ||
|
|
||
| // Reset the state of the tracker in case of new blocks or re-orgs. | ||
| if parent_tracker.hash != Some(relay_parent) { | ||
| parent_tracker = ParentTracker { | ||
| hash: Some(relay_parent), | ||
| has_built: false, | ||
| has_terminated: false, | ||
| }; | ||
| } | ||
|
|
||
| // The collator has terminated the building time, the `adjust_authoring_duration` has | ||
| // determined that there is no time left to build the block. This block must be skipped | ||
| // to avoid building blocks wrongfully when the import races with | ||
| // `wait_until_next_slot`. | ||
| if parent_tracker.has_terminated { | ||
| tracing::debug!( | ||
| target: crate::LOG_TARGET, | ||
| ?relay_parent, | ||
| "Skipping block production on terminated relay parent." | ||
| ); | ||
| continue; | ||
| } | ||
|
|
||
| let Some(parent_search_result) = | ||
| crate::collators::find_parent(relay_parent, para_id, &*para_backend, &relay_client) | ||
| .await | ||
|
|
@@ -438,6 +500,14 @@ where | |
| "Not building block due to insufficient authoring duration." | ||
| ); | ||
|
|
||
| // In practice with 12 cores, it is possible that the wall clock (aura_slot) aligns | ||
| // perfectly with the parachain slot deduced from the relay block. For those cases, | ||
| // the first block production opportunity will be skipped immediately. This field | ||
| // ensures the block is marked as terminated only after building. | ||
| if parent_tracker.has_built { | ||
| parent_tracker.has_terminated = true; | ||
| } | ||
|
|
||
| continue; | ||
| }; | ||
|
|
||
|
|
@@ -461,6 +531,9 @@ where | |
| continue; | ||
| }; | ||
|
|
||
| // Mark the collator has built a block on this relay parent. | ||
| parent_tracker.has_built = true; | ||
|
|
||
| let new_block_hash = candidate.block.header().hash(); | ||
|
|
||
| // Announce the newly built block to our peers. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are both the same here?