-
Notifications
You must be signed in to change notification settings - Fork 38
Description
I had a disk fill up, which caused neptune-core to crash/exit, due to a write error.
After freeing up space, the node starts again OK and reports correct balance. But once the node tries to sync from a peer I consistently get this error:
2025-12-10T02:10:23.053244099Z INFO ThreadId(01) neptune_cash::application::loops::main_loop: Creating new sync request
thread 'main' (952697) panicked at /home/user/neptune-core/neptune-core/src/application/database/storage/storage_schema/dbtvec_private.rs:50:9:
Out-of-bounds. Got 37455 but length was 37455. persisted vector name: archival_block_mmr
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
So:
-
Clearly the database(s) are in an inconsistent state. this indicates that the disk writes are somehow not atomic, or in a wrong order. It should always be that raw-truth data is written first, and then summary data that references it, not vice versa.
-
In particular "Got 37455 but length was 37455" is indicating that some code believes there is a DbtVec with index 37455, but we only have 37455 actual entries, so off by 1. If the len is correct, then where is the invalid 37455 index coming from, and why was it written before the write to append to the DbtVec? My guess is that it is coming (somehow) from a different DB. Let's check...
-
In MainLoopHandler::block_sync() we call ArchivalBlockMmr::get_leaf_async(), passing it a height, which comes from
Self::batch_request_uca_candidate_heights(own_tip_height). That fn only subtracts from the tip height, so we must look at where the tip height came from, and we find it isglobal_state.chain.light_state().kernel.header.height, which ultimately comes from ArchivalState::block_index_db. -
Thus it appears that a record for height 37455 was written to block_index_db before corresponding record gets written to archival_block_mmr. But when reading we also go to the block_index_db first, which means we expose the inconsistent state. If we instead wrote to archival_block_mmr first then the read pattern would have been fine in this case, as it is no problem if the height from block_index_db is lower than the vec length in archival_block_mmr.
-
In general, this exposes how brittle our many-db files solution is. In my view, it would be preferable to use a db engine such as redb that supports atomic writes across multiple logical tables cleanly. That would likely be an easier change to make than fixing all possible ordering problems (also good practice).
-
This issue is a direct "in practice" result of the analysis in Atomicity of database updates across block updates #79 that remains open. It is worth reading Pieter Wiulle's answer to my question about Bitcoin's approach to atomicity. tl;dr: they use fewer DBs and are careful about ordering, ie "In this list, the later entries contain references to the previous ones".
https://bitcoin.stackexchange.com/a/121561/49396
- how can a user recover from this? I tried "--scan-blocks 19000.." but that didn't help. I have not yet tried rebuilding all databases from scratch, but that seems like perhaps what is necessary.
Full stack trace follows.
2025-12-10T02:35:48.621408432Z DEBUG ThreadId(63) neptune_cash::protocol::consensus::block::validity::block_program: ** Call to triton_vm::verify to verify block proof completed; verdict: true.
2025-12-10T02:35:48.636932858Z DEBUG ThreadId(01) neptune_cash::application::loops::main_loop: Running sync
2025-12-10T02:35:48.636961693Z INFO ThreadId(01) neptune_cash::application::loops::main_loop: Creating new sync request
thread 'main' (962750) panicked at /home/danda/neptune-core/neptune-core/src/application/database/storage/storage_schema/dbtvec_private.rs:50:9:
Out-of-bounds. Got 37455 but length was 37455. persisted vector name: archival_block_mmr
stack backtrace:
0: __rustc::rust_begin_unwind
1: core::panicking::panic_fmt
2: <neptune_cash::application::database::storage::storage_schema::dbtvec::DbtVec<V> as neptune_cash::application::database::storage::storage_vec::traits::StorageVecBase<V>>::get::{{closure}}
3: neptune_cash::util_types::archival_mmr::ArchivalMmr<Storage>::get_leaf_async::{{closure}}
4: neptune_cash::application::loops::main_loop::MainLoopHandler::run::{{closure}}
5: <core::pin::Pin<P> as core::future::future::Future>::poll
6: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
7: tokio::runtime::context::runtime::enter_runtime
8: tokio::runtime::runtime::Runtime::block_on
9: neptune_core::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.642519225Z DEBUG ThreadId(01) neptune_cash: executed run()::select::block_sync_interval in 0.005594934 secs. exceeds slow fn threshold of 0.001 secs. location: /home/danda/neptune-core/neptune-core/src/application/loops/main_loop.rs:1858:21
2025-12-10T02:35:48.642602375Z WARN ThreadId(62) neptune_cash::application::loops::peer_loop: Main loop broadcast channel closed. Exiting peer loop.
2025-12-10T02:35:48.642638824Z DEBUG ThreadId(03) neptune_cash::application::job_queue::job_handle: JobHandle dropping for job: 4ed3797f6c8311c613102fb4
2025-12-10T02:35:48.642647494Z DEBUG ThreadId(62) neptune_cash::application::loops::peer_loop: Exited peer loop for 139.162.193.206:9798
2025-12-10T02:35:48.64267519Z DEBUG ThreadId(62) neptune_cash::application::loops::connect_to_peers: Fetched peer info standing 0 for peer 139.162.193.206:9798
2025-12-10T02:35:48.708337487Z DEBUG ThreadId(63) neptune_cash::protocol::consensus::block::validity::block_program: ** Calling triton_vm::verify to verify block proof ...
thread 'tokio-runtime-worker' (962815) panicked at /home/danda/neptune-core/neptune-core/src/protocol/proof_abstractions/verifier.rs:63:14:
should be able to verify proof in new tokio task: JoinError::Cancelled(Id(37931))
stack backtrace:
0: __rustc::rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: neptune_cash::protocol::proof_abstractions::verifier::verify::{{closure}}
4: neptune_cash::protocol::consensus::block::Block::is_valid::{{closure}}
5: neptune_cash::protocol::peer::SyncChallengeResponse::is_valid::{{closure}}
6: neptune_cash::application::loops::peer_loop::PeerLoopHandler::handle_peer_message::{{closure}}
7: neptune_cash::application::loops::peer_loop::PeerLoopHandler::run::{{closure}}
8: neptune_cash::application::loops::peer_loop::PeerLoopHandler::run_wrapper::{{closure}}
9: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}::{{closure}}
10: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}
11: <core::pin::Pin<P> as core::future::future::Future>::poll
12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
13: tokio::runtime::task::core::Core<T,S>::poll
14: tokio::runtime::task::harness::Harness<T,S>::poll
15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
16: tokio::runtime::scheduler::multi_thread::worker::Context::run
17: tokio::runtime::context::scoped::Scoped<T>::set
18: tokio::runtime::context::runtime::enter_runtime
19: tokio::runtime::scheduler::multi_thread::worker::run
20: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
21: tokio::runtime::task::core::Core<T,S>::poll
22: tokio::runtime::task::harness::Harness<T,S>::poll
23: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.709210641Z DEBUG ThreadId(63) neptune_cash: executed handle_peer_message()::PeerMessage::SyncChallengeResponse in 0.847406868 secs. exceeds slow fn threshold of 0.001 secs. location: /home/danda/neptune-core/neptune-core/src/application/loops/peer_loop.rs:773:17
2025-12-10T02:35:48.709434918Z ERROR ThreadId(63) neptune_cash::application::loops::connect_to_peers: Peer task (outgoing) for 162.219.87.217:9798 panicked. Invoking close connection callback
2025-12-10T02:35:48.709453596Z DEBUG ThreadId(63) neptune_cash::application::loops::connect_to_peers: Fetched peer info standing 0 for peer 162.219.87.217:9798
thread 'tokio-runtime-worker' (962815) panicked at /home/danda/neptune-core/neptune-core/src/application/database/neptune_leveldb.rs:260:60:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(37932))
stack backtrace:
0: __rustc::rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: neptune_cash::application::database::neptune_leveldb::NeptuneLevelDb<Key,Value>::get::{{closure}}
4: neptune_cash::application::loops::connect_to_peers::close_peer_connected_callback::{{closure}}
5: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}
6: <core::pin::Pin<P> as core::future::future::Future>::poll
7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
8: tokio::runtime::task::core::Core<T,S>::poll
9: tokio::runtime::task::harness::Harness<T,S>::poll
10: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
11: tokio::runtime::scheduler::multi_thread::worker::Context::run
12: tokio::runtime::context::scoped::Scoped<T>::set
13: tokio::runtime::context::runtime::enter_runtime
14: tokio::runtime::scheduler::multi_thread::worker::run
15: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
16: tokio::runtime::task::core::Core<T,S>::poll
17: tokio::runtime::task::harness::Harness<T,S>::poll
18: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.710066993Z DEBUG ThreadId(63) neptune_cash::state: spawning flush db thread
2025-12-10T02:35:48.713175621Z INFO ThreadId(171) neptune_cash::state: GlobalState is dropping. flushing database
2025-12-10T02:35:48.735421404Z DEBUG ThreadId(171) neptune_cash::state: Flushed all databases