-
Notifications
You must be signed in to change notification settings - Fork 229
ADR: HA failover #2598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ADR: HA failover #2598
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
|
||
# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover | ||
|
||
## Changelog | ||
|
||
- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context. |
||
|
||
## Context | ||
|
||
We need a robust, deterministic way to keep L2 block production live when the primary sequencer becomes unhealthy or unreachable, and to **recover leadership** without split‑brain or unsafe reorgs. The solution must integrate cleanly with `ev-node`, be observable, and support zero‑downtime upgrades. This ADR evaluates two designs for the **control plane** that governs which node is allowed to run the sequencer process. | ||
|
||
## Alternative Approaches | ||
|
||
Considered but not chosen for this iteration: | ||
|
||
- **Many replicas, no coordination**: high risk of **simultaneous leaders** (split‑brain) and soft‑confirmation reversals. | ||
- **Full BFT consensus among sequencers**: heavier operational/engineering cost than needed; our fault model is crash‑fault tolerance with honest operators. | ||
- **Outsource ordering to a shared sequencer network**: viable but introduces an external dependency and different SLOs; out of scope for the immediate milestone. | ||
- **Manual failover only**: too slow and error‑prone for production SLOs. | ||
|
||
## Decision | ||
|
||
> We will operate **1 active + 1 failover** sequencer at all times, regardless of control plane. Two implementation options are approved: | ||
|
||
- **Design A — Rafted Conductor (CFT)**: A sidecar *conductor* runs next to each `ev-node`. Conductors form a **Raft** cluster to elect a single leader and **gate** sequencing so only the leader may produce blocks. For quorum while preserving 1‑active/1‑failover semantics, we will run **2 sequencer nodes + 1 conductor‑only witness** (no sequencer) as the third Raft voter. | ||
tac0turtle marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
*Note:* OP Stack uses a very similar pattern for its sequencer; see `op-conductor` in References. | ||
|
||
- **Design B — 1‑Active / 1‑Failover (Lease/Lock)**: One hot standby promotes itself when the active fails by acquiring a **lease/lock** (e.g., Kubernetes Lease or external KV). Strong **fencing** ensures the old leader cannot keep producing after lease loss. | ||
|
||
**Why both assume 1A/1F:** Even with Raft, we intentionally keep only **one** hot standby capable of immediate promotion; additional nodes may exist as **read‑only** or **witness** roles to strengthen quorum without enabling extra leaders. | ||
|
||
Status of this decision: **Proposed** for implementation and test hardening. | ||
|
||
## Detailed Design | ||
|
||
### User requirements | ||
- **No split‑brain**: at most one sequencer is active. | ||
- **Deterministic recovery**: new leader starts from a known **unsafe head**. | ||
- **Fast failover**: p50 ≤ 15s, p95 ≤ 45s. | ||
- **Operational clarity**: health metrics, leader identity, and explicit admin controls. | ||
- **Zero‑downtime upgrades**: blue/green leadership transfer. | ||
|
||
### Systems affected | ||
- `ev-node` (sequencer control hooks, health surface). | ||
- New sidecar(s): **conductor** (Design A) or **lease‑manager** (Design B). | ||
- RPC ingress (optional **leader‑aware proxy** to route sequencing endpoints only to the leader). | ||
- CI/CD & SRE runbooks, dashboards, alerts. | ||
|
||
### New/changed data structures | ||
- **UnsafeHead** record persisted by control plane: `(l2_number, l2_hash, l1_origin, timestamp)`. | ||
- **Design A (Raft)**: replicated **Raft log** entries for `UnsafeHead`, `LeadershipTerm`, and optional `CommitMeta` (batch/DA pointers); periodic snapshots. | ||
- **Design B (Lease)**: a single **Lease** record (Kubernetes Lease or external KV entry) plus a monotonic **lease token** for fencing. | ||
|
||
### New/changed APIs | ||
Introduce an **Admin RPC** (gRPC/HTTP) on `ev-node` (or a thin shim) used by either control plane: | ||
|
||
- `StartSequencer(from_unsafe_head: bool)` — start sequencing, optionally pinning to the last persisted UnsafeHead. | ||
- `StopSequencer()` — hard stop; no more block production. | ||
- `SequencerHealthy()` → `{ healthy, l2_number, l2_hash, l1_origin, peer_count, da_height, last_err }` | ||
- `Status()` → `{ sequencer_active, build_height, leader_hint?, last_err }` | ||
|
||
These are additive and should not break existing RPCs. | ||
|
||
### Efficiency considerations | ||
- **Design A:** Raft heartbeats and snapshotting add small steady‑state overhead; no impact on throughput when healthy. | ||
- **Design B:** Lease renewals are lightweight; performance dominated by `ev-node` itself. | ||
|
||
### Expected access patterns | ||
- Reads (RPC, state) should work on all nodes; **writes/sequence endpoints** only on the active leader. If a leader‑aware proxy is deployed, it enforces this automatically. | ||
|
||
### Logging/Monitoring/Observability | ||
- Metrics: `leader_id`, `raft_term` (A), `lease_owner` (B), `unsafe_head_advance`, `peer_count`, `rpc_error_rate`, `da_publish_latency`, `backlog`. | ||
tac0turtle marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- Alerts: no unsafe advance > 3× block time; unexpected leader churn; lease lost but sequencer still active (fencing breach); witness down (A). | ||
- Logs: audit all **Start/Stop** decisions and override operations. | ||
|
||
### Security considerations | ||
tac0turtle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Lock down **Admin RPC** with mTLS + RBAC; only the sidecar/process account may call Start/Stop. | ||
- Implement **fencing**: leader periodically validates it still holds leadership/lease; otherwise self‑stops. | ||
- Break‑glass overrides must be gated behind separate credentials and produce auditable events. | ||
|
||
### Privacy considerations | ||
- None beyond existing node telemetry; no user data added. | ||
|
||
### Testing plan | ||
- Kill active sequencer → verify failover within SLO; assert **no double leadership**. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader. |
||
- Partition tests: only Raft majority (A) or lease holder (B) may produce. | ||
- Blue/green: explicit leadership handoff; confirm unsafe head continuity. | ||
- Misconfigured standby → failover should **refuse**; alarms fire. | ||
- Long‑duration outage drills; confirm user‑facing status and catch‑up behavior. | ||
|
||
### Change breakdown | ||
- Phase 1: Implement Admin RPC + health surface in `ev-node`; add sidecar skeletons. | ||
- Phase 2: Integrate Design A (Raft) in a 2 sequencer + 1 witness topology; build dashboards/runbooks. | ||
tac0turtle marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- Phase 3: Add Design B (Lease) profile for small/test clusters; share common health logic. | ||
- Phase 4: Game days and SLO validation; finalize SRE playbooks. | ||
|
||
### Release/compatibility | ||
- **Breaking release?** No — Admin RPCs are additive. | ||
- **Coordination with LazyLedger fork / lazyledger-app?** Not required; DA posting interfaces are unchanged. | ||
tac0turtle marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
## Status | ||
|
||
Proposed | ||
|
||
## Consequences | ||
|
||
### Positive | ||
- Clear, deterministic leadership with fencing; supports zero‑downtime upgrades. | ||
- Works with `ev-node` via a small, well‑defined Admin RPC. | ||
- Choice of control plane allows right‑sizing ops: Raft for prod; Lease for small/test. | ||
|
||
### Negative | ||
- Design A adds Raft operational overhead (quorum management, snapshots, witness requirement). | ||
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the failure being the other node is not up to date with the latest state? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about the external kv store availability. For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode. Assuming the external KV store is exposed via TCP, high availability must cover:
If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the kv store will always be local to the node, we dont support adding remote kv stores (dbs) |
||
- Additional components (sidecars, proxies) increase deployment surface. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases |
||
### Neutral | ||
- Small steady‑state CPU/network overhead for heartbeats/leases; negligible compared to sequencing and DA posting. | ||
|
||
## References | ||
|
||
- **OP conductor** (industry prior art; similar to Design A): | ||
- Docs: https://docs.optimism.io/operators/chain-operators/tools/op-conductor | ||
- README: https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/README.md | ||
|
||
- **`ev-node`** (architecture, sequencing): | ||
- Repo: https://github.com/evstack/ev-node | ||
- Quick start: https://ev.xyz/guides/quick-start | ||
- Discussions/issues on sequencing API & multi-sequencer behavior. | ||
|
||
- **Lease-based leader election**: | ||
- Kubernetes Lease API: https://kubernetes.io/docs/concepts/architecture/leases/ | ||
- client-go leader election helpers: https://pkg.go.dev/k8s.io/client-go/tools/leaderelection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example:
Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock
.