Skip to content

Conversation

liamaharon
Copy link
Collaborator

@liamaharon liamaharon commented Aug 7, 2025

Phase 2 in #1887

Phase 2 additions:

  • Applies the OTF Patched Polkadot SDK
  • Added code to automatically sync Aura keystore with the Babe keystore
  • Add -otf suffix to subtensor crates that share the same name as a polkadot-sdk crate. This is to prevent name collisions when applying patches, and ensure we don't accidentally use the wrong crate.
  • Fixed this vulnerability raised by @shamil-gadelshin
  • Fixed edge case issue where node would not automatically switch from Babe to Aura service

Update Aug 26

I discovered there is an issue warp syncing a chain that contains an Aura to Babe migration. The warp sync would succeed up until the first Babe block is mined, and then revert to a regular sync.

Root Cause

Polkadot SDK does not allow starting a service with warp sync if the database is in a partially synced state: https://github.com/opentensor/polkadot-sdk/blob/d13f915d8a1f55af53fd51fdb4544c47badddc7e/substrate/client/network/sync/src/strategy/warp.rs#L235-L252

In the initial implementation of this PR, while a node is warp syncing, it will restart part-way through the sync when it detects the first Babe block.

At the time of the service restart, the node has a partially synced database. This causes the previously referenced code to terminate the warp sync, and fall back to a regular sync.

Solution

To resolve this issue, we must warp sync the entire chain (Aura AND Babe blocks) without restarting the service.

This is achieved in two parts:

First, I prevented the Aura service switching to a Babe service while the node is syncing. This was achieved by adding new code here:

let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some());

Second, I added support for the node to import Babe blocks while running an Aura service. This was achieved by replacing the AuraWrappedImportQueue with a HybridImportQueue that contains

  • A HybridBlockImport that contains inner full implementations of AuraBlockImport and BabeBlockImport
  • A HybridVerifier that contains inner full implementations of the AuraVerifier and BabeVerifier
  • An import_queue function that builds an ImportQueue implementation capable of completely importing both Aura and Babe blocks.

The Aura service is required to construct a BabeConfiguration to pass to the hybrid import_queue, so it can import the first Babe block it encounters. This required me to pull in some Babe runtime configuration from #1708 into this PR, specifically the BABE_GENESIS_EPOCH_CONFIG and EPOCH_DURATION_IN_BLOCKS.

With these runtime constants, we are able to construct what our initial Babe configuration will be while running an Aura service:

/// Returns what the Babe configuration is expected to be at the first Babe block.
///
/// This is required for the hybrid import queue, so it is ready to validate the first encountered
/// babe block(s) before switching to Babe consensus.
fn get_expected_babe_configuration<B: BlockT, C>(
client: &C,
) -> sp_blockchain::Result<BabeConfiguration>
where
C: AuxStore + ProvideRuntimeApi<B> + UsageProvider<B>,
C::Api: AuraApi<B, AuraAuthorityId>,
{
let at_hash = if client.usage_info().chain.finalized_state.is_some() {
client.usage_info().chain.best_hash
} else {
client.usage_info().chain.genesis_hash
};
let runtime_api = client.runtime_api();
let authorities = runtime_api
.authorities(at_hash)?
.into_iter()
.map(|a| (BabeAuthorityId::from(a.into_inner()), 1))
.collect();
let slot_duration = runtime_api.slot_duration(at_hash)?.as_millis();
let epoch_config = node_subtensor_runtime::BABE_GENESIS_EPOCH_CONFIG;
let config = sp_consensus_babe::BabeConfiguration {
slot_duration,
epoch_length: node_subtensor_runtime::EPOCH_DURATION_IN_SLOTS,
c: epoch_config.c,
authorities,
randomness: Default::default(),
allowed_slots: epoch_config.allowed_slots,
};
Ok(config)
}

Summary

With the two changes described in "Solution", when a node is warp syncing, it will warp sync the entire chain (all Aura and Babe blocks) and import the state entirely before it switches to running a Babe service. This resolves the issue root cause of the issue, which is that the node would restart mid-warp-sync.

It is important this phase is merged prior to phase 3, so node operators have time to upgrade in advance of the runtime upgrade.

Steps to simulate Aura -> Babe migration with finney state

  1. Set up https://github.com/opentensor/baedeker-for-subtensor. Ask Greg if you have questions.
  2. Build Babe NPoS runtime from Permissioned Babe NPoS Runtime #1708
$ git checkout node-decentralization
$ cargo b -r -p node-subtensor && cp ./target/release/wbuild/node-subtensor-runtime/node_subtensor_runtime.compact.compressed.wasm ./babe-npos.wasm
  1. Build node from this branch
$ git checkout hybrid-node
$ rm ./target/release/node-subtensor && cargo b -r -p node-subtensor
  1. Run Baedeker using node from this branch
$ cd ../baedeker-for-subtensor
$ ./localnet-baedeker.sh
  1. Upgrade to Babe NPoS runtime. Ask Liam if you have questions.

QA Checklist

  • --initial-consensus aura gracefully switches to Babe post-upgrade
  • --initial-consensus babe gracefully switches to Aura pre-upgrade
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime
  • Babe upgrade works when enacted in early blocks
    • devnet-ready state/runtime
    • hybrid-node state/runtime
  • Babe upgrade works when enacted in later blocks (where era would be >1)
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime
  • Warp sync Aura & Babe chain works when switch happened in early blocks
    • devnet-ready
    • hybrid-node
  • Warp sync Aura & Babe chain works when switch happened in later blocks (where era would be >1)
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime

@liamaharon liamaharon mentioned this pull request Aug 7, 2025
15 tasks
@liamaharon liamaharon force-pushed the hybrid-node branch 2 times, most recently from 7d8246c to 483729e Compare August 7, 2025 23:27
@liamaharon liamaharon changed the title Hybrid Consensus Node + Full Support for Aura -> Babe Runtime Upgrades Support for Aura -> Babe Runtime Upgrades Aug 7, 2025
@liamaharon liamaharon marked this pull request as ready for review August 7, 2025 23:43
@liamaharon liamaharon marked this pull request as draft August 7, 2025 23:45
@liamaharon liamaharon changed the title Support for Aura -> Babe Runtime Upgrades Node Support for Aura -> Babe Runtime Upgrades Aug 8, 2025
@liamaharon liamaharon marked this pull request as ready for review August 8, 2025 04:38
@liamaharon liamaharon requested a review from sam0x17 August 8, 2025 04:40
@liamaharon liamaharon force-pushed the hybrid-node branch 3 times, most recently from eaba58d to 5d2d0e2 Compare August 8, 2025 05:17
update Cargo.lock
Copy link
Collaborator

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind expanding this?

"Warp sync only works from a fresh sync. This means, if the service switches from Aura -> Babe mid-warp-sync, the warp sync will not continue correctly.""

Please, consider adding more details for changes - the high level description helps but it lacks the technical reasoning. Maybe even a list of the added algorithms with descriptions. Also, consider adding a description for major commit updates - for example, the same PR was changed significantly compared to the previously approved version that required full retesting and rereviewing from the beginning.

New features (example):

  • babe-switch signal that passes message from aura consensus to babe consensus. It shut downs aura service and starts babe service that in turn ....
  • hybrid import queue, it checks only....
  • babe configuration parameters (EPOCH_DURATION) .....

Did you consider rare situations for consensus switch (end of the epoch, validator failure, etc)? I wonder are there situations when aura produces at least one block in presence of the babe configuration?

let import_queue = super::aura_wrapped_import_queue::import_queue(
crate::consensus::aura_wrapped_import_queue::ImportQueueParams {
// Aura needs the hybrid import queue, because it needs to
// 1. Validate the first Babe block it encounters before switching into Babe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only "first Babe block"? Shouldn't it import all blocks from the target warp sync block (state import block) up to the head before starting importing the block header history?

Copy link
Collaborator Author

@liamaharon liamaharon Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that's an old comment from before AuraConsensus could import Babe blocks. I will update it.

Cargo.toml Outdated
pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" }
node-subtensor-runtime = { path = "runtime", default-features = false }
pallet-admin-utils = { path = "pallets/admin-utils", default-features = false }
pallet-collective-otf = { path = "pallets/collective", default-features = false }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use "subtensor" as a middle word for pallets. Also, why does it cause issues? Do you mind describing the conflicts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflict is that these crate names are identical with crates in polkadot-sdk.

This means when we patch the Parity polkadot-sdk crates with the OTF polkadot-sdk, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk versions.

By making the names of our crates unique, it is impossible for us to accidentally patch them.

I can change the name to put -subtensor- in the middle instead of -otf prefix if you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gztensor What do you suggest?

&self,
block: BlockCheckParams<Block>,
) -> Result<ImportResult, Self::Error> {
self.inner_aura.check_block(block).await.map_err(Into::into)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check only aura blocks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aura and Babe check_block implementations are identical. I will add a comment.

break;
}
};
tokio::time::sleep(slot_duration.as_duration()).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the actual block duration exceeds the slot duration? And in general, if the trigger activates in unsynchronized way after the block production?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the actual block duration exceeds the slot duration?

It will wait one more "block_duration" period of time before it checks again.

And in general, if the trigger activates in unsynchronized way after the block production?

It is actually unlikely that all nodes will restart "perfectly" synchronised with one another. In practice with baedeker state, their restart time vary by ~1 second. This is not a problem though, even in the case of some catastrophic disruption and they restart minutes apart, as soon as 2/3 of them are online they will begin finalizing again.

@liamaharon
Copy link
Collaborator Author

liamaharon commented Aug 28, 2025

Please, consider adding more details for changes - the high level description helps but it lacks the technical reasoning. Maybe even a list of the added algorithms with descriptions. Also, consider adding a description for major commit updates - for example, the same PR was changed significantly compared to the previously approved version that required full retesting and rereviewing from the beginning.

Apologies you found my description not clear enough. I will write up the changes in more detail. I am also happy to jump onto a call with you to walk through anything if you might find that helpful.

Do you mind expanding this?

"Warp sync only works from a fresh sync. This means, if the service switches from Aura -> Babe mid-warp-sync, the warp sync will not continue correctly.""

Warp sync does not work if you try to start on a partially synced DB, it logs an error and returns to full sync. Please let me know if there is something in particular about that you find unclear.

Did you consider rare situations for consensus switch (end of the epoch, validator failure, etc)? I wonder are there situations when aura produces at least one block in presence of the babe configuration?

There are no circumstances where an Aura block could be produced after the Babe runtime upgrade. The runtime after that point is configured only for Babe.

@liamaharon
Copy link
Collaborator Author

Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning.

Please let me know if that is more clear, and you have any further questions.

As always, appreciate your time and help reviewing this PR

Copy link
Collaborator

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning.

Please let me know if that is more clear, and you have any further questions.

As always, appreciate your time and help reviewing this PR

Thank you! I was finally able to understand the warp sync issues and some of your decisions.

I have a couple of minor issues left (crate names and "triggered" variable). Otherwise, looks good.

Cargo.toml Outdated
pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" }
node-subtensor-runtime = { path = "runtime", default-features = false }
pallet-admin-utils = { path = "pallets/admin-utils", default-features = false }
pallet-collective-otf = { path = "pallets/collective", default-features = false }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gztensor What do you suggest?

// sync is still in progress prior to switching, the warp sync will not
// complete successfully.
let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some());
if !c.authorities.is_empty() && !syncing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any caveats for not switching to Babe immediately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anything I can think while syncing.

@shamil-gadelshin
Copy link
Collaborator

shamil-gadelshin commented Sep 3, 2025

The conflict is that these crate names are identical with crates in polkadot-sdk.
This means when we patch the Parity polkadot-sdk crates with the OTF polkadot-sdk, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk versions.
By making the names of our crates unique, it is impossible for us to accidentally patch them.
I can change the name to put -subtensor- in the middle instead of -otf prefix if you prefer.

After the conversation with @gztensor we confirmed that we use "-subtensor-" for such cases. @liamaharon , please, rename the crates when you have time.

@liamaharon
Copy link
Collaborator Author

Thanks @gregzaitsev and @shamil-gadelshin, I've replaced the -otf with -subtensor- as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants