Skip to content

Consensus deadlock: validator node froze for 7 hours on devnet #25273

@ebmifa

Description

@ebmifa

Summary

A devnet validator node (ewr-suival-a4bdd) completely froze for ~7 hours while appearing healthy (process running, systemd showing active). The node stopped processing and logging at 10:10:07 UTC on 2026-02-04, with checkpoint stuck at 1,114,821 while other validators progressed to ~1,264,100.

Environment

  • Node: ewr-suival-a4bdd (devnet validator)
  • Version: v1.65.0-080ab2d4f113-dirty
  • OS: Ubuntu 24.04, kernel 5.15.0-136-generic
  • Hardware: Intel Xeon E-2388G, 125GB RAM
  • Uptime at freeze: ~25 minutes after restart

Timeline

  1. 09:44:59 UTC - Node restarted (PID 593313)
  2. 10:10:06 UTC - Last consensus activity logged
  3. 10:10:07 UTC - Final log: "Thread stalled for 522ms"
  4. 10:10:07 - 16:57 UTC - Complete silence, no logs despite process running
  5. 16:57:41 UTC - Manual restart resolved the issue

Symptoms

  • Process alive with 5% CPU, 34% memory - low activity suggesting blocked, not spinning
  • systemctl status showed "active (running)"
  • No journal entries for 7 hours
  • Metrics endpoint responded but checkpoint was stuck
  • No OOM, no kernel errors, no core dump (not a crash)

Last logs before freeze

{"timestamp":"2026-02-04T10:10:06.761917Z","level":"WARN","fields":{"message":"Failed to send certified blocks: SendError { .. }"},"target":"consensus_core::transaction_certifier"}
{"timestamp":"2026-02-04T10:10:07.241509Z","level":"WARN","fields":{"message":"Thread stalled for 522ms"},"target":"mysten_metrics::thread_stall_monitor"}

The SendError warning appeared multiple times before the freeze but is likely unrelated - looking at the code, _block_receiver in commit_consumer.rs:39 is immediately dropped, making this warning expected.

Observations

  1. Only 1 of 4 devnet validators was affected
  2. All validators had low peer counts (2-3 peers each)
  3. Node had been restarted ~25 minutes before the freeze, suggesting possible startup-related race condition
  4. The freeze appears to be a deadlock (process alive but completely unresponsive) rather than a crash or resource exhaustion

What we couldn't capture

Since the node was restarted to restore service, we lost the opportunity to:

  • Capture a thread dump via gdb
  • Inspect tokio runtime state
  • Identify which threads were blocked and on what

Suggested investigation

  1. Review consensus code paths that could lead to mutual blocking
  2. Consider adding a watchdog that captures thread dumps when checkpoint progress stalls
  3. Review the transaction_certifier.rs SendError handling - while likely not the cause, the dropped receiver pattern seems unintentional

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions