leveldb read error after disk full incident.  (exposes non-atomic writes across DBs)

I had a disk fill up, which caused neptune-core to crash/exit, due to a write error.

After freeing up space, the node starts again OK and reports correct balance.  But once the node tries to sync from a peer I consistently get this error:

```
2025-12-10T02:10:23.053244099Z  INFO ThreadId(01) neptune_cash::application::loops::main_loop: Creating new sync request

thread 'main' (952697) panicked at /home/user/neptune-core/neptune-core/src/application/database/storage/storage_schema/dbtvec_private.rs:50:9:
Out-of-bounds. Got 37455 but length was 37455. persisted vector name: archival_block_mmr
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

So:

1. Clearly the database(s) are in an inconsistent state.  this indicates that the disk writes are somehow not atomic, or in a wrong order.  It should always be that raw-truth data is written first, and then summary data that references it, not vice versa.

2. In particular "Got 37455 but length was 37455" is indicating that some code believes there is a DbtVec with index 37455, but we only have 37455 actual entries, so off by 1.   If the len is correct, then where is the invalid 37455 index coming from, and why was it written before the write to append to the DbtVec?   My guess is that it is coming (somehow) from a different DB.  Let's check... 

3. In MainLoopHandler::block_sync() we call ArchivalBlockMmr::get_leaf_async(), passing it a height, which comes from `Self::batch_request_uca_candidate_heights(own_tip_height)`.   That fn only subtracts from the tip height, so we must look at where the tip height came from, and we find it is `global_state.chain.light_state().kernel.header.height`, which ultimately comes from ArchivalState::block_index_db.

4. Thus it *appears* that a record for height 37455 was written to block_index_db before corresponding record gets written to archival_block_mmr.   But when reading we also go to the block_index_db first, which means we expose the inconsistent state.  If we instead wrote to archival_block_mmr first then the read pattern would have been fine in this case, as it is no problem if the height from block_index_db is lower than the vec length in archival_block_mmr.

5. In general, this exposes how brittle our many-db files solution is.  In my view, it would be preferable to use a db engine such as redb that supports atomic writes across multiple logical tables cleanly.   That would likely be an easier change to make than fixing all possible ordering problems (also good practice).

6. This issue is a direct "in practice" result of the analysis in #79 that remains open.  It is worth reading Pieter Wiulle's answer to my question about Bitcoin's approach to atomicity.   tl;dr:  they use fewer DBs and are careful about ordering, ie "In this list, the later entries contain references to the previous ones".

https://bitcoin.stackexchange.com/a/121561/49396

7. how can a user recover from this?  I tried "--scan-blocks 19000.." but that didn't help.    I have not yet tried rebuilding all databases from scratch, but that seems like perhaps what is necessary.


Full stack trace follows.

```
2025-12-10T02:35:48.621408432Z DEBUG ThreadId(63) neptune_cash::protocol::consensus::block::validity::block_program: ** Call to triton_vm::verify to verify block proof completed; verdict: true.
2025-12-10T02:35:48.636932858Z DEBUG ThreadId(01) neptune_cash::application::loops::main_loop: Running sync
2025-12-10T02:35:48.636961693Z  INFO ThreadId(01) neptune_cash::application::loops::main_loop: Creating new sync request

thread 'main' (962750) panicked at /home/danda/neptune-core/neptune-core/src/application/database/storage/storage_schema/dbtvec_private.rs:50:9:
Out-of-bounds. Got 37455 but length was 37455. persisted vector name: archival_block_mmr
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <neptune_cash::application::database::storage::storage_schema::dbtvec::DbtVec<V> as neptune_cash::application::database::storage::storage_vec::traits::StorageVecBase<V>>::get::{{closure}}
   3: neptune_cash::util_types::archival_mmr::ArchivalMmr<Storage>::get_leaf_async::{{closure}}
   4: neptune_cash::application::loops::main_loop::MainLoopHandler::run::{{closure}}
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
   6: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   7: tokio::runtime::context::runtime::enter_runtime
   8: tokio::runtime::runtime::Runtime::block_on
   9: neptune_core::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.642519225Z DEBUG ThreadId(01) neptune_cash: executed run()::select::block_sync_interval in 0.005594934 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/danda/neptune-core/neptune-core/src/application/loops/main_loop.rs:1858:21
2025-12-10T02:35:48.642602375Z  WARN ThreadId(62) neptune_cash::application::loops::peer_loop: Main loop broadcast channel closed. Exiting peer loop.
2025-12-10T02:35:48.642638824Z DEBUG ThreadId(03) neptune_cash::application::job_queue::job_handle: JobHandle dropping for job: 4ed3797f6c8311c613102fb4
2025-12-10T02:35:48.642647494Z DEBUG ThreadId(62) neptune_cash::application::loops::peer_loop: Exited peer loop for 139.162.193.206:9798
2025-12-10T02:35:48.64267519Z DEBUG ThreadId(62) neptune_cash::application::loops::connect_to_peers: Fetched peer info standing 0 for peer 139.162.193.206:9798
2025-12-10T02:35:48.708337487Z DEBUG ThreadId(63) neptune_cash::protocol::consensus::block::validity::block_program: ** Calling triton_vm::verify to verify block proof ...

thread 'tokio-runtime-worker' (962815) panicked at /home/danda/neptune-core/neptune-core/src/protocol/proof_abstractions/verifier.rs:63:14:
should be able to verify proof in new tokio task: JoinError::Cancelled(Id(37931))
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: neptune_cash::protocol::proof_abstractions::verifier::verify::{{closure}}
   4: neptune_cash::protocol::consensus::block::Block::is_valid::{{closure}}
   5: neptune_cash::protocol::peer::SyncChallengeResponse::is_valid::{{closure}}
   6: neptune_cash::application::loops::peer_loop::PeerLoopHandler::handle_peer_message::{{closure}}
   7: neptune_cash::application::loops::peer_loop::PeerLoopHandler::run::{{closure}}
   8: neptune_cash::application::loops::peer_loop::PeerLoopHandler::run_wrapper::{{closure}}
   9: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}::{{closure}}
  10: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}
  11: <core::pin::Pin<P> as core::future::future::Future>::poll
  12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  16: tokio::runtime::scheduler::multi_thread::worker::Context::run
  17: tokio::runtime::context::scoped::Scoped<T>::set
  18: tokio::runtime::context::runtime::enter_runtime
  19: tokio::runtime::scheduler::multi_thread::worker::run
  20: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  21: tokio::runtime::task::core::Core<T,S>::poll
  22: tokio::runtime::task::harness::Harness<T,S>::poll
  23: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.709210641Z DEBUG ThreadId(63) neptune_cash: executed handle_peer_message()::PeerMessage::SyncChallengeResponse in 0.847406868 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/danda/neptune-core/neptune-core/src/application/loops/peer_loop.rs:773:17
2025-12-10T02:35:48.709434918Z ERROR ThreadId(63) neptune_cash::application::loops::connect_to_peers: Peer task (outgoing) for 162.219.87.217:9798 panicked. Invoking close connection callback
2025-12-10T02:35:48.709453596Z DEBUG ThreadId(63) neptune_cash::application::loops::connect_to_peers: Fetched peer info standing 0 for peer 162.219.87.217:9798

thread 'tokio-runtime-worker' (962815) panicked at /home/danda/neptune-core/neptune-core/src/application/database/neptune_leveldb.rs:260:60:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(37932))
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: neptune_cash::application::database::neptune_leveldb::NeptuneLevelDb<Key,Value>::get::{{closure}}
   4: neptune_cash::application::loops::connect_to_peers::close_peer_connected_callback::{{closure}}
   5: neptune_cash::application::loops::connect_to_peers::call_peer::{{closure}}
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   8: tokio::runtime::task::core::Core<T,S>::poll
   9: tokio::runtime::task::harness::Harness<T,S>::poll
  10: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run
  12: tokio::runtime::context::scoped::Scoped<T>::set
  13: tokio::runtime::context::runtime::enter_runtime
  14: tokio::runtime::scheduler::multi_thread::worker::run
  15: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  16: tokio::runtime::task::core::Core<T,S>::poll
  17: tokio::runtime::task::harness::Harness<T,S>::poll
  18: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-12-10T02:35:48.710066993Z DEBUG ThreadId(63) neptune_cash::state: spawning flush db thread
2025-12-10T02:35:48.713175621Z  INFO ThreadId(171) neptune_cash::state: GlobalState is dropping. flushing database
2025-12-10T02:35:48.735421404Z DEBUG ThreadId(171) neptune_cash::state: Flushed all databases
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

leveldb read error after disk full incident. (exposes non-atomic writes across DBs) #810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

leveldb read error after disk full incident. (exposes non-atomic writes across DBs) #810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions