[SEV] F3 Instance: 183760 not reaching a decision

## Current Status
2025-07-22T05:59:14Z: 🔄 In Progress

- Overall chain advancing as expected. All metrics, including block production and chain quality, indicate healthy chain progression.
- F3 is finalizing but is more choppy and further behind than pre-event levels.
- We're still working to identify the cause of the drop in participation that is causing F3 choppy progress as outlined by the "Stop the Bleeding" steps.
- Current hypothesis is that some SP's haven't set the F3 initial power table CID, they no longer possess the full state for epoch 4918680, and then they drop out of participation when they restart with clean state.  This at least explains the slowish decline we have seen over the last 30 days, but doesn't explain the large ~5% drop that triggered thisevent.  We made a push to get this initial power table CID set ([blog entry](https://medium.com/@filoz/the-f3-journey-now-that-f3-is-active-on-mainnet-there-are-follow-up-steps-to-take-a13a93268c29)) but there was no forcing function for SPs to do this.  We expect a network upgrade will help here.  A network upgrade will also help get nodes updates to the latest SP software which has F3 resiliency improvements based on the two mainnet participation events (plus other security improvements).

## What Happened / What Was Observed
> The chain is advancing with Expected Consensus (EC) as expected. All metrics, including block production and chain quality, indicate healthy chain progression.

F3 started experiencing a lack of decision due to what we thought was a similar failure mode as previously encountered in https://github.com/filecoin-project/lotus/issues/13094#issuecomment-2851568698. 

<img width="853" height="591" alt="Image" src="https://github.com/user-attachments/assets/80db0b73-c512-4c93-83e9-b5e59b55f344" />

## Root Cause Analysis

The failure mode causing F3 to lack progression this time around appears to be:

- **Aider process failure**: The Aider process that we had set up to help propagate messages failed to kick in due to disk space issues / disk rotation problems on the observer node that gathers messages in the network.
   - **Message propagation disruption**: The Aider process could therefor not help propagate/broadcast the messages to the network because it was just receiving zero-byte files due to lack of disk space on the observer.

Upon further investigation, it looks like the lack of progress was actually triggered by a drop in participation:

<img width="773" height="875" alt="Image" src="https://github.com/user-attachments/assets/0e9cd086-3f88-41f9-a2f9-8b1cfc49e944" />

Without the Aider effectively running with enough saved stated in the Observer, it wasn't able to help offset this decline in participation.

## Stop The Bleeding
- ✅ **The Aider and the Observer node are now back up and running again, and are helping propagating messages around in the network**
   - ⏰ **Round timing**: Due to exponential backoff function in F3, rounds keep doubling and the CONVERGE step is timeout based to give all nodes the timeout amount to receive messages. 

Round | CONVERGE timeout | CONVERGE ending datetime
-- | -- | --
10 | 7h | ~2025-07-11 23:30 UTC
11 | 14h | ~2025-07-12 13:30 UTC
12 | 28h | ~2025-07-13 17:30 UTC

- (in progress) See if anyone on #fil-fast-finality is aware of the ~5% participation drop that was observed around ~2025-07-11 2:20 UTC.  ([slack thread](https://filecoinproject.slack.com/archives/C0556MSR945/p1752261614803929))
   - As of 2025-07-21, no reason has been determined

- (in progress) Give visibility to minerIds not participating in F3 at [#global-storage-provider-community](https://filecoinproject.slack.com/archives/C02GQUMFQVA) so hopefully figure out any fixes or improvements that need to be made to F3 software.
   - Post: https://filecoinproject.slack.com/archives/C02GQUMFQVA/p1752704795614989
   - 2025-07-21: this hasn't lead to any determinations of SP behavior or code/config bugs that could have triggered this event.

## Action Items after Stopping the Bleeding

> Note: The longer-term fix for the mentioned propagation issue https://github.com/filecoin-project/lotus/issues/13094 have landed (e.g., https://github.com/filecoin-project/go-f3/pull/1024).  They would have been helpful in this case.  That said, they haven't been bubbled up for certain Lotus/Venus/Forest versions and not yet widely adopted by enough participants in the network just yet for the network to not need the help of the Aider-process.  We can count on this more resilient software being deployed after the [nv27 network upgrade](https://github.com/filecoin-project/core-devs/discussions/196).

- [ ] Identify Node versions that the community should ideally upgrade to, ideally before the [nv27 network upgrade](https://github.com/filecoin-project/core-devs/discussions/196)
   - Lotus: https://github.com/filecoin-project/lotus/issues/13132
   - Forest: next release after 0.27.0 since https://github.com/ChainSafe/forest/pull/5785 has been merged
   - Venus: https://github.com/filecoin-project/venus/issues/6480
- [x] Get better monitoring in place to catch if Aider isn't actually "aiding" (e.g., disk space monitoring/alarming, monitoring/alarming on the amount of broadcasting the process is doing)
    - https://github.com/filecoin-project/go-f3/issues/1041

Note: these exponentially increasing CONVERGE timeouts are painful and we'd like to put an upper limit, but this isn't a quick/easy change.  We are living with this for now and are not committing to a change currently. 

## Additional resources
The internal operational docs for F3 are in https://filoznotebook.notion.site/F3-Operational-Excellence-5cdce4f1aa6e4c398b475f6e690c47fe .  This also references a public F3 dashboard: https://grafana.f3.eng.filoz.org/public-dashboards/e9d8fe95ae9a4341ba2e730f1a4c86be  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SEV] F3 Instance: 183760 not reaching a decision #13208

Current Status

What Happened / What Was Observed

Root Cause Analysis

Stop The Bleeding

Action Items after Stopping the Bleeding

Additional resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Round	CONVERGE timeout	CONVERGE ending datetime
10	7h	~2025-07-11 23:30 UTC
11	14h	~2025-07-12 13:30 UTC
12	28h	~2025-07-13 17:30 UTC

[SEV] F3 Instance: 183760 not reaching a decision #13208

Description

Current Status

What Happened / What Was Observed

Root Cause Analysis

Stop The Bleeding

Action Items after Stopping the Bleeding

Additional resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions