-
Notifications
You must be signed in to change notification settings - Fork 236
Node Support for Aura -> Babe Runtime Upgrades #1927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devnet-ready
Are you sure you want to change the base?
Conversation
7d8246c
to
483729e
Compare
eaba58d
to
5d2d0e2
Compare
update Cargo.lock
f9efe9f
to
c0c7938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind expanding this?
"Warp sync only works from a fresh sync. This means, if the service switches from Aura -> Babe mid-warp-sync, the warp sync will not continue correctly.""
Please, consider adding more details for changes - the high level description helps but it lacks the technical reasoning. Maybe even a list of the added algorithms with descriptions. Also, consider adding a description for major commit updates - for example, the same PR was changed significantly compared to the previously approved version that required full retesting and rereviewing from the beginning.
New features (example):
- babe-switch signal that passes message from aura consensus to babe consensus. It shut downs aura service and starts babe service that in turn ....
- hybrid import queue, it checks only....
- babe configuration parameters (EPOCH_DURATION) .....
Did you consider rare situations for consensus switch (end of the epoch, validator failure, etc)? I wonder are there situations when aura produces at least one block in presence of the babe configuration?
let import_queue = super::aura_wrapped_import_queue::import_queue( | ||
crate::consensus::aura_wrapped_import_queue::ImportQueueParams { | ||
// Aura needs the hybrid import queue, because it needs to | ||
// 1. Validate the first Babe block it encounters before switching into Babe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only "first Babe block"? Shouldn't it import all blocks from the target warp sync block (state import block) up to the head before starting importing the block header history?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that's an old comment from before AuraConsensus
could import Babe blocks. I will update it.
Cargo.toml
Outdated
pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" } | ||
node-subtensor-runtime = { path = "runtime", default-features = false } | ||
pallet-admin-utils = { path = "pallets/admin-utils", default-features = false } | ||
pallet-collective-otf = { path = "pallets/collective", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use "subtensor" as a middle word for pallets. Also, why does it cause issues? Do you mind describing the conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conflict is that these crate names are identical with crates in polkadot-sdk
.
This means when we patch the Parity polkadot-sdk
crates with the OTF polkadot-sdk
, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk
versions.
By making the names of our crates unique, it is impossible for us to accidentally patch them.
I can change the name to put -subtensor-
in the middle instead of -otf
prefix if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gztensor What do you suggest?
&self, | ||
block: BlockCheckParams<Block>, | ||
) -> Result<ImportResult, Self::Error> { | ||
self.inner_aura.check_block(block).await.map_err(Into::into) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check only aura
blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aura and Babe check_block
implementations are identical. I will add a comment.
break; | ||
} | ||
}; | ||
tokio::time::sleep(slot_duration.as_duration()).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the actual block duration exceeds the slot duration? And in general, if the trigger activates in unsynchronized way after the block production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the actual block duration exceeds the slot duration?
It will wait one more "block_duration" period of time before it checks again.
And in general, if the trigger activates in unsynchronized way after the block production?
It is actually unlikely that all nodes will restart "perfectly" synchronised with one another. In practice with baedeker state, their restart time vary by ~1 second. This is not a problem though, even in the case of some catastrophic disruption and they restart minutes apart, as soon as 2/3 of them are online they will begin finalizing again.
Apologies you found my description not clear enough. I will write up the changes in more detail. I am also happy to jump onto a call with you to walk through anything if you might find that helpful.
Warp sync does not work if you try to start on a partially synced DB, it logs an error and returns to full sync. Please let me know if there is something in particular about that you find unclear.
There are no circumstances where an Aura block could be produced after the Babe runtime upgrade. The runtime after that point is configured only for Babe. |
Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning. Please let me know if that is more clear, and you have any further questions. As always, appreciate your time and help reviewing this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning.
Please let me know if that is more clear, and you have any further questions.
As always, appreciate your time and help reviewing this PR
Thank you! I was finally able to understand the warp sync issues and some of your decisions.
I have a couple of minor issues left (crate names and "triggered" variable). Otherwise, looks good.
Cargo.toml
Outdated
pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" } | ||
node-subtensor-runtime = { path = "runtime", default-features = false } | ||
pallet-admin-utils = { path = "pallets/admin-utils", default-features = false } | ||
pallet-collective-otf = { path = "pallets/collective", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gztensor What do you suggest?
…esParams" This reverts commit 00d5003.
9d0ea5d
to
914030d
Compare
// sync is still in progress prior to switching, the warp sync will not | ||
// complete successfully. | ||
let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some()); | ||
if !c.authorities.is_empty() && !syncing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any caveats for not switching to Babe immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not anything I can think while syncing.
After the conversation with @gztensor we confirmed that we use "-subtensor-" for such cases. @liamaharon , please, rename the crates when you have time. |
3a7e6d4
to
8f2b47e
Compare
Thanks @gregzaitsev and @shamil-gadelshin, I've replaced the |
Phase 2 in #1887
Phase 2 additions:
-otf
suffix to subtensor crates that share the same name as a polkadot-sdk crate. This is to prevent name collisions when applying patches, and ensure we don't accidentally use the wrong crate.Update Aug 26
I discovered there is an issue warp syncing a chain that contains an Aura to Babe migration. The warp sync would succeed up until the first Babe block is mined, and then revert to a regular sync.
Root Cause
Polkadot SDK does not allow starting a service with warp sync if the database is in a partially synced state: https://github.com/opentensor/polkadot-sdk/blob/d13f915d8a1f55af53fd51fdb4544c47badddc7e/substrate/client/network/sync/src/strategy/warp.rs#L235-L252
In the initial implementation of this PR, while a node is warp syncing, it will restart part-way through the sync when it detects the first Babe block.
At the time of the service restart, the node has a partially synced database. This causes the previously referenced code to terminate the warp sync, and fall back to a regular sync.
Solution
To resolve this issue, we must warp sync the entire chain (Aura AND Babe blocks) without restarting the service.
This is achieved in two parts:
First, I prevented the Aura service switching to a Babe service while the node is syncing. This was achieved by adding new code here:
subtensor/node/src/consensus/aura_consensus.rs
Line 222 in a234096
Second, I added support for the node to import Babe blocks while running an Aura service. This was achieved by replacing the
AuraWrappedImportQueue
with aHybridImportQueue
that containsHybridBlockImport
that contains inner full implementations ofAuraBlockImport
andBabeBlockImport
HybridVerifier
that contains inner full implementations of theAuraVerifier
andBabeVerifier
import_queue
function that builds anImportQueue
implementation capable of completely importing both Aura and Babe blocks.The Aura service is required to construct a
BabeConfiguration
to pass to the hybridimport_queue
, so it can import the first Babe block it encounters. This required me to pull in some Babe runtime configuration from #1708 into this PR, specifically theBABE_GENESIS_EPOCH_CONFIG
andEPOCH_DURATION_IN_BLOCKS
.With these runtime constants, we are able to construct what our initial Babe configuration will be while running an Aura service:
subtensor/node/src/consensus/aura_consensus.rs
Lines 249 to 285 in a234096
Summary
With the two changes described in "Solution", when a node is warp syncing, it will warp sync the entire chain (all Aura and Babe blocks) and import the state entirely before it switches to running a Babe service. This resolves the issue root cause of the issue, which is that the node would restart mid-warp-sync.
It is important this phase is merged prior to phase 3, so node operators have time to upgrade in advance of the runtime upgrade.
Steps to simulate Aura -> Babe migration with finney state
$ git checkout node-decentralization $ cargo b -r -p node-subtensor && cp ./target/release/wbuild/node-subtensor-runtime/node_subtensor_runtime.compact.compressed.wasm ./babe-npos.wasm
$ git checkout hybrid-node $ rm ./target/release/node-subtensor && cargo b -r -p node-subtensor
$ cd ../baedeker-for-subtensor $ ./localnet-baedeker.sh
QA Checklist
--initial-consensus aura
gracefully switches to Babe post-upgrade--initial-consensus babe
gracefully switches to Aura pre-upgradedevnet-ready
state/runtimehybrid-node
state/runtimebaedeker-finney
state/runtimedevnet-ready
state/runtimehybrid-node
state/runtimedevnet-ready
state/runtimehybrid-node
state/runtimebaedeker-finney
state/runtimedevnet-ready
hybrid-node
devnet-ready
state/runtimehybrid-node
state/runtimebaedeker-finney
state/runtime