Consensus deadlock: validator node froze for 7 hours on devnet

## Summary

A devnet validator node (`ewr-suival-a4bdd`) completely froze for ~7 hours while appearing healthy (process running, systemd showing active). The node stopped processing and logging at 10:10:07 UTC on 2026-02-04, with checkpoint stuck at 1,114,821 while other validators progressed to ~1,264,100.

## Environment

- **Node**: ewr-suival-a4bdd (devnet validator)
- **Version**: v1.65.0-080ab2d4f113-dirty
- **OS**: Ubuntu 24.04, kernel 5.15.0-136-generic
- **Hardware**: Intel Xeon E-2388G, 125GB RAM
- **Uptime at freeze**: ~25 minutes after restart

## Timeline

1. **09:44:59 UTC** - Node restarted (PID 593313)
2. **10:10:06 UTC** - Last consensus activity logged
3. **10:10:07 UTC** - Final log: `"Thread stalled for 522ms"`
4. **10:10:07 - 16:57 UTC** - Complete silence, no logs despite process running
5. **16:57:41 UTC** - Manual restart resolved the issue

## Symptoms

- Process alive with 5% CPU, 34% memory - low activity suggesting blocked, not spinning
- `systemctl status` showed "active (running)"
- No journal entries for 7 hours
- Metrics endpoint responded but checkpoint was stuck
- No OOM, no kernel errors, no core dump (not a crash)

## Last logs before freeze

```
{"timestamp":"2026-02-04T10:10:06.761917Z","level":"WARN","fields":{"message":"Failed to send certified blocks: SendError { .. }"},"target":"consensus_core::transaction_certifier"}
{"timestamp":"2026-02-04T10:10:07.241509Z","level":"WARN","fields":{"message":"Thread stalled for 522ms"},"target":"mysten_metrics::thread_stall_monitor"}
```

The `SendError` warning appeared multiple times before the freeze but is likely unrelated - looking at the code, `_block_receiver` in `commit_consumer.rs:39` is immediately dropped, making this warning expected.

## Observations

1. Only 1 of 4 devnet validators was affected
2. All validators had low peer counts (2-3 peers each)
3. Node had been restarted ~25 minutes before the freeze, suggesting possible startup-related race condition
4. The freeze appears to be a deadlock (process alive but completely unresponsive) rather than a crash or resource exhaustion

## What we couldn't capture

Since the node was restarted to restore service, we lost the opportunity to:
- Capture a thread dump via gdb
- Inspect tokio runtime state
- Identify which threads were blocked and on what

## Suggested investigation

1. Review consensus code paths that could lead to mutual blocking
2. Consider adding a watchdog that captures thread dumps when checkpoint progress stalls
3. Review the `transaction_certifier.rs` SendError handling - while likely not the cause, the dropped receiver pattern seems unintentional

## Related

- Failed workflow run: https://github.com/MystenLabs/sui-operations/actions/runs/21676882921

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consensus deadlock: validator node froze for 7 hours on devnet #25273

Summary

Environment

Timeline

Symptoms

Last logs before freeze

Observations

What we couldn't capture

Suggested investigation

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consensus deadlock: validator node froze for 7 hours on devnet #25273

Description

Summary

Environment

Timeline

Symptoms

Last logs before freeze

Observations

What we couldn't capture

Suggested investigation

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions