Skip to content

Recoverability test: ensure realistic SP outages are recoverable "quick enough" #398

@BigLep

Description

@BigLep

Done Criteria

After the 202601-2 wave of Curio hardening/performance improvements, take an SP out of service for ~5 minutes (e.g., simulating a power outage) and measure how long it takes to recover from the backlog.

Why Important

Power outages will occur and we want to make sure that these events don't tank an SP.

Notes

  1. A real power outage happened on 2026-01-21 affecting calib.ezpdpz.net and calib2.ezpdpz.ne. It lasted 5 minutes, but took ~12 hours to catch up (slack thread). Note that this happened in calibration which has a 12x more frequent proving period, but we believe this is a simulation of what will happen if a mainnet node has 12x the number of datasets as these nodes had.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions