Node Support for Aura -> Babe Runtime Upgrades #1927

liamaharon · 2025-08-07T23:17:15Z

Phase 2 in #1887

Phase 2 additions:

Applies the OTF Patched Polkadot SDK
- See the comment above the Cargo.toml patch explaining how it is applied
- This is critical for Aura to Babe migrations
- Removes the need for the https://github.com/opentensor/grandpa.git patch
Added code to automatically sync Aura keystore with the Babe keystore
Add -otf suffix to subtensor crates that share the same name as a polkadot-sdk crate. This is to prevent name collisions when applying patches, and ensure we don't accidentally use the wrong crate.
Fixed this vulnerability raised by @shamil-gadelshin
Fixed edge case issue where node would not automatically switch from Babe to Aura service

Update Aug 26

I discovered there is an issue warp syncing a chain that contains an Aura to Babe migration. The warp sync would succeed up until the first Babe block is mined, and then revert to a regular sync.

Root Cause

Polkadot SDK does not allow starting a service with warp sync if the database is in a partially synced state: https://github.com/opentensor/polkadot-sdk/blob/d13f915d8a1f55af53fd51fdb4544c47badddc7e/substrate/client/network/sync/src/strategy/warp.rs#L235-L252

In the initial implementation of this PR, while a node is warp syncing, it will restart part-way through the sync when it detects the first Babe block.

At the time of the service restart, the node has a partially synced database. This causes the previously referenced code to terminate the warp sync, and fall back to a regular sync.

Solution

To resolve this issue, we must warp sync the entire chain (Aura AND Babe blocks) without restarting the service.

This is achieved in two parts:

First, I prevented the Aura service switching to a Babe service while the node is syncing. This was achieved by adding new code here:

subtensor/node/src/consensus/aura_consensus.rs

Line 222 in a234096

    
           let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some());

Second, I added support for the node to import Babe blocks while running an Aura service. This was achieved by replacing the AuraWrappedImportQueue with a HybridImportQueue that contains

A HybridBlockImport that contains inner full implementations of AuraBlockImport and BabeBlockImport
A HybridVerifier that contains inner full implementations of the AuraVerifier and BabeVerifier
An import_queue function that builds an ImportQueue implementation capable of completely importing both Aura and Babe blocks.

The Aura service is required to construct a BabeConfiguration to pass to the hybrid import_queue, so it can import the first Babe block it encounters. This required me to pull in some Babe runtime configuration from #1708 into this PR, specifically the BABE_GENESIS_EPOCH_CONFIG and EPOCH_DURATION_IN_BLOCKS.

With these runtime constants, we are able to construct what our initial Babe configuration will be while running an Aura service:

subtensor/node/src/consensus/aura_consensus.rs

Lines 249 to 285 in a234096

    
           /// Returns what the Babe configuration is expected to be at the first Babe block. 
        
           /// 
        
           /// This is required for the hybrid import queue, so it is ready to validate the first encountered 
        
           /// babe block(s) before switching to Babe consensus. 
        
           fn get_expected_babe_configuration<B: BlockT, C>( 
        
               client: &C, 
        
           ) -> sp_blockchain::Result<BabeConfiguration> 
        
           where 
        
               C: AuxStore + ProvideRuntimeApi<B> + UsageProvider<B>, 
        
               C::Api: AuraApi<B, AuraAuthorityId>, 
        
           { 
        
               let at_hash = if client.usage_info().chain.finalized_state.is_some() { 
        
                   client.usage_info().chain.best_hash 
        
               } else { 
        
                   client.usage_info().chain.genesis_hash 
        
               }; 
        
               let runtime_api = client.runtime_api(); 
        
               let authorities = runtime_api 
        
                   .authorities(at_hash)? 
        
                   .into_iter() 
        
                   .map(|a| (BabeAuthorityId::from(a.into_inner()), 1)) 
        
                   .collect(); 
        
               let slot_duration = runtime_api.slot_duration(at_hash)?.as_millis(); 
        
               let epoch_config = node_subtensor_runtime::BABE_GENESIS_EPOCH_CONFIG; 
        
               let config = sp_consensus_babe::BabeConfiguration { 
        
                   slot_duration, 
        
                   epoch_length: node_subtensor_runtime::EPOCH_DURATION_IN_SLOTS, 
        
                   c: epoch_config.c, 
        
                   authorities, 
        
                   randomness: Default::default(), 
        
                   allowed_slots: epoch_config.allowed_slots, 
        
               }; 
        
               Ok(config) 
        
           }

Summary

With the two changes described in "Solution", when a node is warp syncing, it will warp sync the entire chain (all Aura and Babe blocks) and import the state entirely before it switches to running a Babe service. This resolves the issue root cause of the issue, which is that the node would restart mid-warp-sync.

It is important this phase is merged prior to phase 3, so node operators have time to upgrade in advance of the runtime upgrade.

Steps to simulate Aura -> Babe migration with finney state

Set up https://github.com/opentensor/baedeker-for-subtensor. Ask Greg if you have questions.
Build Babe NPoS runtime from Permissioned Babe NPoS Runtime #1708

$ git checkout node-decentralization
$ cargo b -r -p node-subtensor && cp ./target/release/wbuild/node-subtensor-runtime/node_subtensor_runtime.compact.compressed.wasm ./babe-npos.wasm

Build node from this branch

$ git checkout hybrid-node
$ rm ./target/release/node-subtensor && cargo b -r -p node-subtensor

Run Baedeker using node from this branch

$ cd ../baedeker-for-subtensor
$ ./localnet-baedeker.sh

Upgrade to Babe NPoS runtime. Ask Liam if you have questions.

QA Checklist

update Cargo.lock

shamil-gadelshin

Do you mind expanding this?

"Warp sync only works from a fresh sync. This means, if the service switches from Aura -> Babe mid-warp-sync, the warp sync will not continue correctly.""

Please, consider adding more details for changes - the high level description helps but it lacks the technical reasoning. Maybe even a list of the added algorithms with descriptions. Also, consider adding a description for major commit updates - for example, the same PR was changed significantly compared to the previously approved version that required full retesting and rereviewing from the beginning.

New features (example):

babe-switch signal that passes message from aura consensus to babe consensus. It shut downs aura service and starts babe service that in turn ....
hybrid import queue, it checks only....
babe configuration parameters (EPOCH_DURATION) .....

Did you consider rare situations for consensus switch (end of the epoch, validator failure, etc)? I wonder are there situations when aura produces at least one block in presence of the babe configuration?

shamil-gadelshin · 2025-08-27T13:41:59Z

node/src/consensus/aura_consensus.rs

-                let import_queue = super::aura_wrapped_import_queue::import_queue(
-                    crate::consensus::aura_wrapped_import_queue::ImportQueueParams {
+                // Aura needs the hybrid import queue, because it needs to
+                // 1. Validate the first Babe block it encounters before switching into Babe


Why only "first Babe block"? Shouldn't it import all blocks from the target warp sync block (state import block) up to the head before starting importing the block header history?

Good catch, that's an old comment from before AuraConsensus could import Babe blocks. I will update it.

shamil-gadelshin · 2025-08-27T14:10:39Z

Cargo.toml

-pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" }
+node-subtensor-runtime = { path = "runtime", default-features = false }
+pallet-admin-utils = { path = "pallets/admin-utils", default-features = false }
+pallet-collective-otf = { path = "pallets/collective", default-features = false }


We usually use "subtensor" as a middle word for pallets. Also, why does it cause issues? Do you mind describing the conflicts?

The conflict is that these crate names are identical with crates in polkadot-sdk.

This means when we patch the Parity polkadot-sdk crates with the OTF polkadot-sdk, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk versions.

By making the names of our crates unique, it is impossible for us to accidentally patch them.

I can change the name to put -subtensor- in the middle instead of -otf prefix if you prefer.

@gztensor What do you suggest?

shamil-gadelshin · 2025-08-28T12:56:40Z

node/src/consensus/hybrid_import_queue.rs

+        &self,
+        block: BlockCheckParams<Block>,
+    ) -> Result<ImportResult, Self::Error> {
+        self.inner_aura.check_block(block).await.map_err(Into::into)


Why do we check only aura blocks?

Aura and Babe check_block implementations are identical. I will add a comment.

shamil-gadelshin · 2025-08-28T13:17:53Z

node/src/consensus/aura_consensus.rs

+                            break;
+                        }
+                    };
+                    tokio::time::sleep(slot_duration.as_duration()).await;


What happens if the actual block duration exceeds the slot duration? And in general, if the trigger activates in unsynchronized way after the block production?

What happens if the actual block duration exceeds the slot duration?

It will wait one more "block_duration" period of time before it checks again.

And in general, if the trigger activates in unsynchronized way after the block production?

It is actually unlikely that all nodes will restart "perfectly" synchronised with one another. In practice with baedeker state, their restart time vary by ~1 second. This is not a problem though, even in the case of some catastrophic disruption and they restart minutes apart, as soon as 2/3 of them are online they will begin finalizing again.

node/src/consensus/babe_consensus.rs

liamaharon · 2025-08-28T21:26:36Z

Please, consider adding more details for changes - the high level description helps but it lacks the technical reasoning. Maybe even a list of the added algorithms with descriptions. Also, consider adding a description for major commit updates - for example, the same PR was changed significantly compared to the previously approved version that required full retesting and rereviewing from the beginning.

Apologies you found my description not clear enough. I will write up the changes in more detail. I am also happy to jump onto a call with you to walk through anything if you might find that helpful.

Do you mind expanding this?

"Warp sync only works from a fresh sync. This means, if the service switches from Aura -> Babe mid-warp-sync, the warp sync will not continue correctly.""

Warp sync does not work if you try to start on a partially synced DB, it logs an error and returns to full sync. Please let me know if there is something in particular about that you find unclear.

Did you consider rare situations for consensus switch (end of the epoch, validator failure, etc)? I wonder are there situations when aura produces at least one block in presence of the babe configuration?

There are no circumstances where an Aura block could be produced after the Babe runtime upgrade. The runtime after that point is configured only for Babe.

liamaharon · 2025-09-01T02:33:41Z

Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning.

Please let me know if that is more clear, and you have any further questions.

As always, appreciate your time and help reviewing this PR

shamil-gadelshin

Hi @shamil-gadelshin, I have addressed your comments and also re-written the "Update Aug 26" section of the PR to be more in-depth and include technical reasoning.

Please let me know if that is more clear, and you have any further questions.

As always, appreciate your time and help reviewing this PR

Thank you! I was finally able to understand the warp sync issues and some of your decisions.

I have a couple of minor issues left (crate names and "triggered" variable). Otherwise, looks good.

shamil-gadelshin · 2025-09-01T15:10:53Z

Cargo.toml

-pallet-subtensor-swap-rpc = { default-features = false, path = "pallets/swap/rpc" }
+node-subtensor-runtime = { path = "runtime", default-features = false }
+pallet-admin-utils = { path = "pallets/admin-utils", default-features = false }
+pallet-collective-otf = { path = "pallets/collective", default-features = false }


@gztensor What do you suggest?

node/src/consensus/babe_consensus.rs

…esParams" This reverts commit 00d5003.

gztensor · 2025-09-03T16:26:15Z

node/src/consensus/aura_consensus.rs

+						// sync is still in progress prior to switching, the warp sync will not
+						// complete successfully.
+                        let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some());
+                        if !c.authorities.is_empty() && !syncing {


Are there any caveats for not switching to Babe immediately?

Not anything I can think while syncing.

shamil-gadelshin · 2025-09-03T17:48:22Z

The conflict is that these crate names are identical with crates in polkadot-sdk.
This means when we patch the Parity polkadot-sdk crates with the OTF polkadot-sdk, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk versions.
By making the names of our crates unique, it is impossible for us to accidentally patch them.
I can change the name to put -subtensor- in the middle instead of -otf prefix if you prefer.

After the conversation with @gztensor we confirmed that we use "-subtensor-" for such cases. @liamaharon , please, rename the crates when you have time.

liamaharon · 2025-09-04T03:18:23Z

Thanks @gregzaitsev and @shamil-gadelshin, I've replaced the -otf with -subtensor- as suggested.

use patched polkadot-sdk

75fa698

liamaharon mentioned this pull request Aug 7, 2025

[Meta] NPoS + Babe Roadmap #1887

Open

15 tasks

liamaharon added 2 commits August 8, 2025 11:24

basic babe block check in aura mode

0f404cb

aura to babe keystore migration

19ea395

liamaharon force-pushed the hybrid-node branch 2 times, most recently from 7d8246c to 483729e Compare August 7, 2025 23:27

liamaharon changed the title ~~Hybrid Consensus Node + Full Support for Aura -> Babe Runtime Upgrades~~ Support for Aura -> Babe Runtime Upgrades Aug 7, 2025

remove grandpa specific patch

81da4de

liamaharon force-pushed the hybrid-node branch from 483729e to 81da4de Compare August 7, 2025 23:31

liamaharon marked this pull request as ready for review August 7, 2025 23:43

liamaharon marked this pull request as draft August 7, 2025 23:45

clippy

9c6b06e

liamaharon force-pushed the hybrid-node branch from 22d8d6e to 3e770f3 Compare August 8, 2025 04:03

use otf patch

30b09fa

liamaharon force-pushed the hybrid-node branch from 3e770f3 to 30b09fa Compare August 8, 2025 04:03

liamaharon added 3 commits August 8, 2025 16:12

bump spec version

c5124b2

Merge branch 'devnet-ready' into hybrid-node

17b8674

match devnet-ready

32d2c71

liamaharon changed the title ~~Support for Aura -> Babe Runtime Upgrades~~ Node Support for Aura -> Babe Runtime Upgrades Aug 8, 2025

liamaharon added 2 commits August 8, 2025 16:24

cargo +nightly fmt

5227bad

improve comment

8d6b090

liamaharon requested review from shamil-gadelshin and gztensor August 8, 2025 04:37

liamaharon marked this pull request as ready for review August 8, 2025 04:38

liamaharon requested a review from sam0x17 August 8, 2025 04:40

liamaharon force-pushed the hybrid-node branch 3 times, most recently from eaba58d to 5d2d0e2 Compare August 8, 2025 05:17

update polkadot-sdk patch

8cf0dc8

update Cargo.lock

liamaharon force-pushed the hybrid-node branch from d7eea71 to 8cf0dc8 Compare August 9, 2025 01:48

liamaharon added 7 commits August 22, 2025 17:33

comment

4c1ca4a

merge devnet-ready

0f37b9a

comment

c81e02b

Merge branch 'devnet-ready' into hybrid-node

8b8951d

clippy

29f9d37

cargo +nightly fmt

dc90812

use conditional block import

c0c7938

liamaharon force-pushed the hybrid-node branch from f9efe9f to c0c7938 Compare August 26, 2025 02:23

auto-update benchmark weights

36b65b3

shamil-gadelshin requested changes Aug 28, 2025

View reviewed changes

liamaharon added 3 commits September 1, 2025 13:24

merge devnet-ready

4cb23b2

simplify check_block

7f8a2d0

improve comment

a234096

liamaharon requested review from shamil-gadelshin and sam0x17 September 1, 2025 02:14

unify spawn_essential_handles params into SpawnEssentialHandlesParams

00d5003

shamil-gadelshin requested changes Sep 1, 2025

View reviewed changes

Revert "unify spawn_essential_handles params into SpawnEssentialHandl…

914030d

…esParams" This reverts commit 00d5003.

liamaharon force-pushed the hybrid-node branch from 9d0ea5d to 914030d Compare September 2, 2025 00:23

consistent naming for custom_service_signal

eb88056

liamaharon requested a review from shamil-gadelshin September 2, 2025 00:30

sam0x17 assigned liamaharon Sep 3, 2025

gztensor reviewed Sep 3, 2025

View reviewed changes

replace _otf with _subtensor_

8f2b47e

liamaharon force-pushed the hybrid-node branch from 3a7e6d4 to 8f2b47e Compare September 4, 2025 03:15

auto-update benchmark weights

780e6b8

	/// Returns what the Babe configuration is expected to be at the first Babe block.
	///
	/// This is required for the hybrid import queue, so it is ready to validate the first encountered
	/// babe block(s) before switching to Babe consensus.
	fn get_expected_babe_configuration<B: BlockT, C>(
	client: &C,
	) -> sp_blockchain::Result<BabeConfiguration>
	where
	C: AuxStore + ProvideRuntimeApi<B> + UsageProvider<B>,
	C::Api: AuraApi<B, AuraAuthorityId>,
	{
	let at_hash = if client.usage_info().chain.finalized_state.is_some() {
	client.usage_info().chain.best_hash
	} else {
	client.usage_info().chain.genesis_hash
	};

	let runtime_api = client.runtime_api();
	let authorities = runtime_api
	.authorities(at_hash)?
	.into_iter()
	.map(\|a\| (BabeAuthorityId::from(a.into_inner()), 1))
	.collect();

	let slot_duration = runtime_api.slot_duration(at_hash)?.as_millis();
	let epoch_config = node_subtensor_runtime::BABE_GENESIS_EPOCH_CONFIG;
	let config = sp_consensus_babe::BabeConfiguration {
	slot_duration,
	epoch_length: node_subtensor_runtime::EPOCH_DURATION_IN_SLOTS,
	c: epoch_config.c,
	authorities,
	randomness: Default::default(),
	allowed_slots: epoch_config.allowed_slots,
	};

	Ok(config)
	}

Node Support for Aura -> Babe Runtime Upgrades #1927

Are you sure you want to change the base?

Node Support for Aura -> Babe Runtime Upgrades #1927

Conversation

liamaharon commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update Aug 26

Root Cause

Solution

Summary

Steps to simulate Aura -> Babe migration with finney state

QA Checklist

Uh oh!

shamil-gadelshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liamaharon Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liamaharon commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liamaharon commented Sep 1, 2025

Uh oh!

shamil-gadelshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shamil-gadelshin commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liamaharon commented Sep 4, 2025

Uh oh!

Uh oh!

liamaharon commented Aug 7, 2025 •

edited

Loading

liamaharon Aug 28, 2025 •

edited

Loading

liamaharon commented Aug 28, 2025 •

edited

Loading

shamil-gadelshin commented Sep 3, 2025 •

edited

Loading