Skip to content

[SEV] F3 Instance: 183760 not reaching a decision #13208

@rjan90

Description

@rjan90

Current Status

2025-07-22T05:59:14Z: 🔄 In Progress

  • Overall chain advancing as expected. All metrics, including block production and chain quality, indicate healthy chain progression.
  • F3 is finalizing but is more choppy and further behind than pre-event levels.
  • We're still working to identify the cause of the drop in participation that is causing F3 choppy progress as outlined by the "Stop the Bleeding" steps.
  • Current hypothesis is that some SP's haven't set the F3 initial power table CID, they no longer possess the full state for epoch 4918680, and then they drop out of participation when they restart with clean state. This at least explains the slowish decline we have seen over the last 30 days, but doesn't explain the large ~5% drop that triggered thisevent. We made a push to get this initial power table CID set (blog entry) but there was no forcing function for SPs to do this. We expect a network upgrade will help here. A network upgrade will also help get nodes updates to the latest SP software which has F3 resiliency improvements based on the two mainnet participation events (plus other security improvements).

What Happened / What Was Observed

The chain is advancing with Expected Consensus (EC) as expected. All metrics, including block production and chain quality, indicate healthy chain progression.

F3 started experiencing a lack of decision due to what we thought was a similar failure mode as previously encountered in #13094 (comment).

Image

Root Cause Analysis

The failure mode causing F3 to lack progression this time around appears to be:

  • Aider process failure: The Aider process that we had set up to help propagate messages failed to kick in due to disk space issues / disk rotation problems on the observer node that gathers messages in the network.
    • Message propagation disruption: The Aider process could therefor not help propagate/broadcast the messages to the network because it was just receiving zero-byte files due to lack of disk space on the observer.

Upon further investigation, it looks like the lack of progress was actually triggered by a drop in participation:

Image

Without the Aider effectively running with enough saved stated in the Observer, it wasn't able to help offset this decline in participation.

Stop The Bleeding

  • The Aider and the Observer node are now back up and running again, and are helping propagating messages around in the network
    • Round timing: Due to exponential backoff function in F3, rounds keep doubling and the CONVERGE step is timeout based to give all nodes the timeout amount to receive messages.
Round CONVERGE timeout CONVERGE ending datetime
10 7h ~2025-07-11 23:30 UTC
11 14h ~2025-07-12 13:30 UTC
12 28h ~2025-07-13 17:30 UTC
  • (in progress) See if anyone on #fil-fast-finality is aware of the ~5% participation drop that was observed around ~2025-07-11 2:20 UTC. (slack thread)

    • As of 2025-07-21, no reason has been determined
  • (in progress) Give visibility to minerIds not participating in F3 at #global-storage-provider-community so hopefully figure out any fixes or improvements that need to be made to F3 software.

Action Items after Stopping the Bleeding

Note: The longer-term fix for the mentioned propagation issue #13094 have landed (e.g., filecoin-project/go-f3#1024). They would have been helpful in this case. That said, they haven't been bubbled up for certain Lotus/Venus/Forest versions and not yet widely adopted by enough participants in the network just yet for the network to not need the help of the Aider-process. We can count on this more resilient software being deployed after the nv27 network upgrade.

Note: these exponentially increasing CONVERGE timeouts are painful and we'd like to put an upper limit, but this isn't a quick/easy change. We are living with this for now and are not committing to a change currently.

Additional resources

The internal operational docs for F3 are in https://filoznotebook.notion.site/F3-Operational-Excellence-5cdce4f1aa6e4c398b475f6e690c47fe . This also references a public F3 dashboard: https://grafana.f3.eng.filoz.org/public-dashboards/e9d8fe95ae9a4341ba2e730f1a4c86be

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    ⌨️ In Progress

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions