-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Open
Description
Summary
A devnet validator node (ewr-suival-a4bdd) completely froze for ~7 hours while appearing healthy (process running, systemd showing active). The node stopped processing and logging at 10:10:07 UTC on 2026-02-04, with checkpoint stuck at 1,114,821 while other validators progressed to ~1,264,100.
Environment
- Node: ewr-suival-a4bdd (devnet validator)
- Version: v1.65.0-080ab2d4f113-dirty
- OS: Ubuntu 24.04, kernel 5.15.0-136-generic
- Hardware: Intel Xeon E-2388G, 125GB RAM
- Uptime at freeze: ~25 minutes after restart
Timeline
- 09:44:59 UTC - Node restarted (PID 593313)
- 10:10:06 UTC - Last consensus activity logged
- 10:10:07 UTC - Final log:
"Thread stalled for 522ms" - 10:10:07 - 16:57 UTC - Complete silence, no logs despite process running
- 16:57:41 UTC - Manual restart resolved the issue
Symptoms
- Process alive with 5% CPU, 34% memory - low activity suggesting blocked, not spinning
systemctl statusshowed "active (running)"- No journal entries for 7 hours
- Metrics endpoint responded but checkpoint was stuck
- No OOM, no kernel errors, no core dump (not a crash)
Last logs before freeze
{"timestamp":"2026-02-04T10:10:06.761917Z","level":"WARN","fields":{"message":"Failed to send certified blocks: SendError { .. }"},"target":"consensus_core::transaction_certifier"}
{"timestamp":"2026-02-04T10:10:07.241509Z","level":"WARN","fields":{"message":"Thread stalled for 522ms"},"target":"mysten_metrics::thread_stall_monitor"}
The SendError warning appeared multiple times before the freeze but is likely unrelated - looking at the code, _block_receiver in commit_consumer.rs:39 is immediately dropped, making this warning expected.
Observations
- Only 1 of 4 devnet validators was affected
- All validators had low peer counts (2-3 peers each)
- Node had been restarted ~25 minutes before the freeze, suggesting possible startup-related race condition
- The freeze appears to be a deadlock (process alive but completely unresponsive) rather than a crash or resource exhaustion
What we couldn't capture
Since the node was restarted to restore service, we lost the opportunity to:
- Capture a thread dump via gdb
- Inspect tokio runtime state
- Identify which threads were blocked and on what
Suggested investigation
- Review consensus code paths that could lead to mutual blocking
- Consider adding a watchdog that captures thread dumps when checkpoint progress stalls
- Review the
transaction_certifier.rsSendError handling - while likely not the cause, the dropped receiver pattern seems unintentional
Related
- Failed workflow run: https://github.com/MystenLabs/sui-operations/actions/runs/21676882921
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels