Skip to content

Investigate possible crash during epoch ledger snapshot syncing #17899

@cjjdespres

Description

@cjjdespres

(I'm still trying to reproduce this locally. I got the crash with --hardfork-mode auto enabled - which we still have to rename - but the code looks like it could be susceptible to the same bug without that enabled. I also did not save the exact crash message, unfortunately.).

I was periodically syncing a daemon to devnet, with and without --hardfork-mode auto, and the daemon crashed in this bit of the code while --hardfork-mode auto was enabled:

let root_ledger_of_snapshot snapshot snapshot_config =
O1trace.sync_thread "root_ledger_of_snapshot" (fun () ->
match snapshot.ledger with
| Ledger_snapshot.Ledger_root ledger ->
Ok ledger
| Ledger_snapshot.Genesis_epoch_ledger packed ->
Genesis_ledger.Packed.create_root packed
~config:snapshot_config
~depth:Context.constraint_constants.ledger_depth () )
in

The create_root function threw an exception when trying to sync one of the epoch snapshots because the rocksdb checkpoint failed - the target directory of the checkpoint already existed. In other words, there was an epoch ledger snapshot already at the snapshot_config location while the daemon was still at the genesis epoch snapshot.

This was not failing in my local testing before - it may have started because of #17874. Before that PR, we'd do this in this situation:

| Ledger_snapshot.Genesis_epoch_ledger packed ->
let fresh_root_ledger =
Mina_ledger.Ledger.Root.create ~logger
~config:snapshot_config
~depth:Context.constraint_constants.ledger_depth
()
in
Genesis_ledger.Packed.populate_root packed
fresh_root_ledger )

That Leder.Root.create would open up whatever database is present at that config location. (The code before we made all these root ledger handling changes did the same thing). It would then overwrite the contents of the database with the genesis ledger, and then sync the ledger to the network. Thus, the daemon did not have to care about cleaning up an old epoch ledger database that was lying around.

I'm unsure of a few things:

  • If this can be reproduced with --hardfork-mode auto, or if I can get this to show up without that enabled. (I'm still looking at it).
  • If the daemon was correctly at the genesis epoch ledger snapshots at the moment it crashed.

We might want to add some code to delete any snapshot backing that might be present at the snapshot_config location before creating a new root from genesis. Though, if this only shows up with --hardfork-mode auto, then this kind of failure might be the result of a bug elsewhere.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions