-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Current Status
2025-07-22T05:59:14Z: 🔄 In Progress
- Overall chain advancing as expected. All metrics, including block production and chain quality, indicate healthy chain progression.
- F3 is finalizing but is more choppy and further behind than pre-event levels.
- We're still working to identify the cause of the drop in participation that is causing F3 choppy progress as outlined by the "Stop the Bleeding" steps.
- Current hypothesis is that some SP's haven't set the F3 initial power table CID, they no longer possess the full state for epoch 4918680, and then they drop out of participation when they restart with clean state. This at least explains the slowish decline we have seen over the last 30 days, but doesn't explain the large ~5% drop that triggered thisevent. We made a push to get this initial power table CID set (blog entry) but there was no forcing function for SPs to do this. We expect a network upgrade will help here. A network upgrade will also help get nodes updates to the latest SP software which has F3 resiliency improvements based on the two mainnet participation events (plus other security improvements).
What Happened / What Was Observed
The chain is advancing with Expected Consensus (EC) as expected. All metrics, including block production and chain quality, indicate healthy chain progression.
F3 started experiencing a lack of decision due to what we thought was a similar failure mode as previously encountered in #13094 (comment).

Root Cause Analysis
The failure mode causing F3 to lack progression this time around appears to be:
- Aider process failure: The Aider process that we had set up to help propagate messages failed to kick in due to disk space issues / disk rotation problems on the observer node that gathers messages in the network.
- Message propagation disruption: The Aider process could therefor not help propagate/broadcast the messages to the network because it was just receiving zero-byte files due to lack of disk space on the observer.
Upon further investigation, it looks like the lack of progress was actually triggered by a drop in participation:

Without the Aider effectively running with enough saved stated in the Observer, it wasn't able to help offset this decline in participation.
Stop The Bleeding
- ✅ The Aider and the Observer node are now back up and running again, and are helping propagating messages around in the network
- ⏰ Round timing: Due to exponential backoff function in F3, rounds keep doubling and the CONVERGE step is timeout based to give all nodes the timeout amount to receive messages.
Round | CONVERGE timeout | CONVERGE ending datetime |
---|---|---|
10 | 7h | ~2025-07-11 23:30 UTC |
11 | 14h | ~2025-07-12 13:30 UTC |
12 | 28h | ~2025-07-13 17:30 UTC |
-
(in progress) See if anyone on #fil-fast-finality is aware of the ~5% participation drop that was observed around ~2025-07-11 2:20 UTC. (slack thread)
- As of 2025-07-21, no reason has been determined
-
(in progress) Give visibility to minerIds not participating in F3 at #global-storage-provider-community so hopefully figure out any fixes or improvements that need to be made to F3 software.
- Post: https://filecoinproject.slack.com/archives/C02GQUMFQVA/p1752704795614989
- 2025-07-21: this hasn't lead to any determinations of SP behavior or code/config bugs that could have triggered this event.
Action Items after Stopping the Bleeding
Note: The longer-term fix for the mentioned propagation issue #13094 have landed (e.g., filecoin-project/go-f3#1024). They would have been helpful in this case. That said, they haven't been bubbled up for certain Lotus/Venus/Forest versions and not yet widely adopted by enough participants in the network just yet for the network to not need the help of the Aider-process. We can count on this more resilient software being deployed after the nv27 network upgrade.
- Identify Node versions that the community should ideally upgrade to, ideally before the nv27 network upgrade
- Lotus: Lotus Node v1.33.1 Release #13132
- Forest: next release after 0.27.0 since fix:
forest-cli f3 pt get 0
not working ChainSafe/forest#5785 has been merged - Venus: Update f3 to v0.8.7 venus#6480
- Get better monitoring in place to catch if Aider isn't actually "aiding" (e.g., disk space monitoring/alarming, monitoring/alarming on the amount of broadcasting the process is doing)
Note: these exponentially increasing CONVERGE timeouts are painful and we'd like to put an upper limit, but this isn't a quick/easy change. We are living with this for now and are not committing to a change currently.
Additional resources
The internal operational docs for F3 are in https://filoznotebook.notion.site/F3-Operational-Excellence-5cdce4f1aa6e4c398b475f6e690c47fe . This also references a public F3 dashboard: https://grafana.f3.eng.filoz.org/public-dashboards/e9d8fe95ae9a4341ba2e730f1a4c86be