proposal for ha failover design #723

drewbailey · 2026-01-27T15:38:49Z

PR form of #720 for easier commenting / collaboration

Summary by CodeRabbit

Documentation
- Added a comprehensive HA failover guide covering Active/Passive cross-region disaster recovery, a multi-state HA controller model, replication and recovery flows, failover/failback scenarios and timing, configuration examples, GSLB/DNS routing and health checks, observability and alerts, operational limitations (lag, split‑brain risks) and proposals for future enhancements.

coderabbitai · 2026-01-27T15:39:09Z

📝 Walkthrough

Walkthrough

Adds a new proposal at docs/proposals/ha-failover.md describing an Active/Passive HA failover design for the ArgoCD agent: cross-region replication, principal-to-principal streaming proto (Subscribe/GetSnapshot/Status), a multi-state HA controller, failover/failback flows, GSLB/DNS guidance, observability, limitations, and future work. (50 words)

Changes

Cohort / File(s)	Summary
HA Failover Design Documentation `docs/proposals/ha-failover.md`	New comprehensive design doc (+354 lines) detailing cross-region disaster recovery, principal-to-principal replication (Subscribe, GetSnapshot, Status RPCs), replication message formats (`ReplicatedEvent`, `ReplicationSnapshot`, `AgentState`, `Resource`, `ReplicationAck`, `ReplicationStatus`), a 5‑state HA controller, failover/failback sequences, agent reconnection, GSLB/DNS and healthcheck guidance, observability/alerts, limitations, and future work.

Sequence Diagram(s)

(omitted — documentation-only addition; no runtime/control-flow code changes requiring diagrams)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through pages, stitched each line,
Regions paired and heartbeats all align,
States that tumble, then stand tall and true,
Replicas whisper, "I’ve got you,"
A carrot crown for safe rollout too. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'proposal for ha failover design' directly and clearly describes the main change - adding a comprehensive HA failover design proposal documentation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@docs/proposals/ha-failover.md`:
- Around line 190-258: Add a security section describing how principals
authenticate and authorize each other for the Replication service (service
Replication, rpc Subscribe, rpc GetSnapshot, messages ReplicatedEvent and
ReplicationSnapshot) by requiring mutual TLS for all replication connections,
documenting certificate issuance, rotation/expiry handling and trust roots, and
specifying an authorization model (e.g., agent identity in cert CN/SAN mapped to
allowed replicas/agents and per-peer ACLs or RBAC) plus recommendations for
logging and auditing of failed auth attempts; include guidance on secret
storage, automated rotation, and fallback/renewal behavior to ensure continuous
authenticated replication.
- Around line 666-667: The "Replication Lag Monitoring" section is empty; please
add guidance that defines the key metrics to expose (replication lag time,
pending event count, last replicated sequence/LSN), recommends alerting
thresholds (e.g., warning at X seconds/minutes and critical at Y, or pending
events > N), and provides troubleshooting steps (check network/CPU/IO, review
producer/consumer backpressure, inspect WAL/streaming errors, and steps to
resync or promote replicas). Reference the section header "Replication Lag
Monitoring" so the new content sits directly beneath it and use concrete
examples/thresholds and commands/log locations relevant to your HA
implementation.
- Line 3: Fix the typo in the AI disclosure line: locate the string "_ai
disclosure: diagrams and tables were formatted/generatd with ai_" and change
"generatd" to "generated" so the line reads "_ai disclosure: diagrams and tables
were formatted/generated with ai_" (or keep the original spacing/punctuation
style) to correct the misspelling.
- Around line 703-706: Replace the four broken relative links ([argocd-agent
Architecture], [Event System], [Queue Implementation], [Resync Protocol]) in the
proposal with TODO placeholders noting the implementation/link will be added
later; specifically remove or replace the paths "../README.md",
"../internal/event/event.go", "../internal/queue/queue.go",
"../internal/resync/resync.go" with explanatory TODO text (e.g., "TODO: link to
implementation or API docs when available") so the proposal no longer references
non-documentation files and the pipeline validation passes.

🧹 Nitpick comments (1)

docs/proposals/ha-failover.md (1)
19-67: Optional: Add language specifier to code blocks.

Consider adding text or ascii as the language specifier for ASCII diagram blocks to satisfy markdown linters. This applies to several diagram blocks throughout the document (lines 19-67, 81-123, 268-314, 344-370, 391-419).
📋 Example fix
-```
+```text
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                           Global DNS / GSLB                                 │

docs/proposals/ha-failover.md

coderabbitai · 2026-01-27T15:43:41Z

docs/proposals/ha-failover.md

+- [argocd-agent Architecture](../README.md)
+- [Event System](../internal/event/event.go)
+- [Queue Implementation](../internal/queue/queue.go)
+- [Resync Protocol](../internal/resync/resync.go)


⚠️ Potential issue | 🔴 Critical

Fix broken documentation references.

The pipeline validation fails because these relative paths point to code files, not documentation. Since this is a proposal document, these references should either:

Use repository URLs to link to code (if already merged)

Reference API documentation (if generated docs exist)

Be removed or marked as TODO (if the code doesn't exist yet)

For a design proposal, option 3 is most appropriate. These can be added when the implementation is complete.

🔧 Proposed fix (mark as TODO for future implementation)

## References -- [argocd-agent Architecture](../README.md) -- [Event System](../internal/event/event.go) -- [Queue Implementation](../internal/queue/queue.go) -- [Resync Protocol](../internal/resync/resync.go) +- argocd-agent Architecture (see main README.md) +- Event System implementation (TODO: link when implemented) +- Queue Implementation (TODO: link when implemented) +- Resync Protocol (TODO: link when implemented)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- [argocd-agent Architecture](../README.md)

- [Event System](../internal/event/event.go)

- [Queue Implementation](../internal/queue/queue.go)

- [Resync Protocol](../internal/resync/resync.go)

- argocd-agent Architecture (see main README.md)

- Event System implementation (TODO: link when implemented)

- Queue Implementation (TODO: link when implemented)

- Resync Protocol (TODO: link when implemented)

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 703 - 706, Replace the four broken relative links ([argocd-agent Architecture], [Event System], [Queue Implementation], [Resync Protocol]) in the proposal with TODO placeholders noting the implementation/link will be added later; specifically remove or replace the paths "../README.md", "../internal/event/event.go", "../internal/queue/queue.go", "../internal/resync/resync.go" with explanatory TODO text (e.g., "TODO: link to implementation or API docs when available") so the proposal no longer references non-documentation files and the pipeline validation passes.

jannfis · 2026-01-27T18:11:54Z

docs/proposals/ha-failover.md

+1. **Replica knows it's a replica** - Won't accept agents while replication stream is healthy
+2. **Failover requires replication stream break** - Only then does replica become active


For the split-brain problem, I can imagine the following scenario between Principal 1 (PR1) and Principal 2 (PR2):

PR1 is primary, PR2 is replica

PR1 isn't able to communicate to PR2 due to network failure/issue/hiccup

Replication stream breaks

PR2, not receiving any replication data, assumes primary role

PR2, still able to send to (but not receive from) PR1 sends replication data to PR1

PR1 still assumes it is primary - split brain

I do not have any idea currently on how to solve this without a quorum (i.e. a third principal or external observer), but that would inherently increase complexity.

Yeah, this is something I have been trying to grapple with. I think we can prevent split-brain (at the expense of availability) using self-fencing & requiring peer ACKs on replication events.

Primary must continuously prove replica is listening, if there are no ACKs, the principal self-fences.

PR1 (primary):
- Sends heartbeats every 5s on replication stream
- Expects ACKs from PR2
- No ACK for 15s → FENCED state

PR2 (replica):
- No heartbeat/events for 30s → promotes to STANDBY
- Trusts that PR1 will self-fence

I also think there could be a pluggable interface to allow the use of vendor solutions like route53 ARC to be an external witness & help with fencing and coordination.

A new state, fenced would allow for something like

ACTIVE ──(no ACKs for ackTimeout)──► FENCED │ ├─ /healthz: UNHEALTHY ├─ reject new agent connections ├─ terminate existing agent sessions │ ▼ (peer reconnects as primary) │ ▼ SYNCING

Failure Scenarios

Scenario 1: PR2 Crashes (Clean Failure) Time PR1 (primary) PR2 (replica) Agents ───────────────────────────────────────────────────────────────── T+0 Serving agents Dies Connected to PR1 T+5s Heartbeat fails - Still on PR1 T+10s No ACKs... - Still on PR1 T+15s FENCED - Disconnected /healthz UNHEALTHY outage - no healthy principal

I believe this is unavoidable with two nodes & no external witness, PR1 cant distinguish “PR2 is dead” from “PR2 is partitioned”. To prevent split-brain, PR1 must always fence on lost ACKs, even if that causes a brief outage. External witness / something like ARC can help in this failure mode

Scenario 2: Asymmetric Partition (PR1 → PR2 Broken)

Time PR1 (primary) PR2 (replica) DNS/GSLB ───────────────────────────────────────────────────────────────── T+0 Can't reach PR2 Replication breaks PR1 healthy Heartbeats fail Starts 30s timer T+15s No ACKs → FENCED Waiting... PR1 UNHEALTHY Terminates agents PR2 healthy T+30s - Promotes → STANDBY Routes to PR2 T+45s - Agent connects - → ACTIVE Stable on PR2 Result: Clean failover, no split-brain. PR1 fenced before PR2 promoted.

Scenario 3: Full Partition (Neither Can Reach Other) Time PR1 (primary) PR2 (replica) DNS/GSLB ───────────────────────────────────────────────────────────────── T+0 Can't reach PR2 Can't receive Depends on No ACKs from PR1 GSLB's view T+15s Probes PR2: unreachable Waiting... Stays ACTIVE T+30s Still ACTIVE Probes PR1: unreachable Promotes → STANDBY Both think peer is dead - potential dual-active

Mitigation relies on DNS/GSLB. Even if both are ACTIVE-capable:

GSLB health checks determine who gets traffic

Term/epoch rejects stale updates

When partition heals, one demotes based on term comparison

Scenario 4: Primary Network Blip (Transient) Time PR1 (primary) PR2 (replica) ───────────────────────────────────────────────────────── T+0 Network hiccup Replication pauses T+5s Still no ACKs Waiting... T+10s Network recovers Receives heartbeat Receives ACK Sends ACK Stays ACTIVE Stays REPLICATING Result: No state change. ackTimeout (15s) > typical network blip.

Propose a self fencing mechanism, primary must receive acks from replica, or fence itself to prevent dual active principles during partitions. - Add FENCED state to HA state machine with recovery logic - Add term/epoch for rejecting stale updates from old leaders - Add bidirectional heartbeat on replication stream - Add failure scenario analysis (5 scenarios documented) - Add agent behavior section: term in handshake, reconnect on stale - Add optional TrafficAuthority interface for external witnesses (ARC) - Update proto with heartbeat message and term fields - Update configuration with new timeouts (ackTimeout, recoveryTimeout)

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@docs/proposals/ha-failover.md`:
- Around line 629-650: The document currently treats GSLB PRIMARY/SECONDARY
routing as optional but the review shows it's a hard correctness requirement;
update the Design Decisions table by adding a new row titled "GSLB tie-breaking"
with the decision "MUST use PRIMARY/SECONDARY or equivalent" and rationale
"Prevents dual-active during full partition", update the Requirements section to
include a MUST requirement for deterministic GSLB tie-breaking (use
PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity) using MUST language,
and add a prominent warning in the GSLB setup section stating that deterministic
tie-breaking is critical (no round-robin/least-connections) and listing
acceptable options so operators cannot deploy without a tie-breaker.
- Around line 461-470: Clarify the term increment semantics: state that
automatic promotion (failover) increments term when a STANDBY node transitions
to ACTIVE upon the first agent connect, whereas manual failback increments term
when the operator issues the command (referenced as the "argocd-agent ha
failback" / "become primary" step); update the STANDBY row in the table (the
"STANDBY: Accept → Transition to ACTIVE (term++)" text) to explicitly state this
distinction or modify the manual failback sequence example to show that the
operator command performs the term++ and then transitions state to ACTIVE so
both descriptions are consistent.
- Around line 326-335: Update the Subscribe RPC to accept a client-side framed
message that includes an explicit initial handshake before ACKs: change the rpc
signature to stream ReplicationClientMessage so the replica can send a
ReplicationHandshake (with term, last_ack_sequence, state enum e.g.
SYNCING/REPLICATING) as the first message followed by ReplicationAck messages;
add the ReplicationClientMessage wrapper (oneof { ReplicationHandshake,
ReplicationAck }) and ensure server-side Subscribe handler validates the
handshake fields (term and last_ack_sequence) and uses them to choose resume
point or reject/migrate the replica before processing subsequent ACKs.
- Around line 539-563: The unfencing logic in shouldUnfence currently returns
true based only on gslbResolvesToMe() || gslbHasNoHealthyTarget(), which risks
split-brain and references an undefined gslbHasNoHealthyTarget; update
shouldUnfence to (1) implement or remove gslbHasNoHealthyTarget() and (2) add an
explicit peer liveness/state probe (e.g., implement a new method like
probePeerIsActive() or reuse existing peer-check utilities) and require that
probePeerIsActive() == false before allowing unfence; ensure you call the probe
from shouldUnfence (alongside gslbResolvesToMe()) so the fenced node only
unfences if DNS conditions are met AND the peer is confirmed not ACTIVE.
- Around line 798-834: Add preflight checks and make the "ha failback" command
atomic by validating prerequisites (cluster connectivity, replication synced,
quorum, and agent connectivity) before executing the two-step signal sequence
shown in the failback example; implement a two-phase coordination where both
regions agree to the state change (prepare/commit), include configurable
timeouts, explicit error returns on any network/partition failures, and a
rollback path that reverts any partial changes if the commit to Region A
(promote) fails after demoting Region B, and ensure the command waits for and
verifies ACKs from Region B (replica) to Region A (new primary) before reporting
success.
- Around line 680-699: Make term persistence mandatory (resolve Open Question
`#4`) and update the handshake: when receiving AgentHello include a strict
validation in the principal's handshake logic to compare agent_term vs
principal_term and return FAILED_PRECONDITION with a clear message like "agent
term too new, principal restarted" if agent_term > principal_term; ensure the
principal refuses the connection in that case. On the agent side, update the
gRPC handshake retry handling so that on this specific "term too new" error the
agent clears its cached term and reconnects to fetch the new principal term
(same reconnection/backoff flow applies). Also document that term MUST be
persisted by the principal so it cannot regress after restart.
- Around line 82-92: The self-fencing section is ambiguous about which endpoint
probePeerHealth() checks; update the text to explicitly state that
probePeerHealth() probes the peer's HTTP health endpoint (port 8080, /healthz)
rather than the replication port (8404). Change the three decision bullets (No
ACK for ackTimeout, peer unreachable, peer reachable but not ACKing) to
reference probing :8080/healthz so readers implement the probe correctly and
avoid confusing replication-port failures with a dead peer.

🧹 Nitpick comments (1)

docs/proposals/ha-failover.md (1)

198-208: Document operational considerations for replication bandwidth and backpressure.

The replication design sends ALL events across ALL agents (line 151) with full CloudEvents payloads. Missing considerations:

Bandwidth & Scalability:

How much bandwidth is required for N agents? (estimates/benchmarks)

Is cross-region bandwidth cost acceptable?

Are there compression or batching strategies?

Backpressure:

What happens if the replica falls behind (network slowdown, resource constraints)?

Does the primary buffer events? For how long?

Is there a circuit breaker to fence the primary if replication lag exceeds a threshold?

Memory:

In-memory buffering of pending events could grow unbounded

Need limits and overflow handling

These operational concerns directly impact the "1-5 minute RTO" design goal and could cause split-brain if not handled correctly (e.g., primary continues serving while replica has stale state).

Consider documenting:

Expected replication bandwidth per agent

Maximum tolerable replication lag before fencing

Backpressure strategy (block writes? shed load? fence?)

Memory limits for replication buffers

docs/proposals/ha-failover.md

coderabbitai · 2026-01-28T15:01:52Z

docs/proposals/ha-failover.md

+service Replication {
+    // Subscribe to replication stream (replica calls this on primary)
+    // Bidirectional: primary sends events/heartbeats, replica sends ACKs
+    rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);
+
+    // Get initial snapshot (replica calls on connect)
+    rpc GetSnapshot(SnapshotRequest) returns (ReplicationSnapshot);
+
+    // Get replication status
+    rpc Status(StatusRequest) returns (ReplicationStatus);


🛠️ Refactor suggestion | 🟠 Major

Add initial handshake message to Subscribe RPC for replica state synchronization.

The Subscribe RPC is bidirectional but lacks an initial handshake from the replica. Currently:

rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);

Problem: The primary doesn't know the replica's:

Current term (to detect term mismatch immediately)

Last acknowledged sequence number (to resume from correct point)

State (SYNCING, REPLICATING, etc.)

Risk: After network reconnection, the primary may send events the replica already has or skip events it missed, causing state divergence.

Recommended proto enhancement

+message ReplicationHandshake { + uint64 replica_term = 1; + uint64 last_acked_sequence_num = 2; + string replica_state = 3; +} + +message ReplicaMessage { + oneof payload { + ReplicationHandshake handshake = 1; + ReplicationAck ack = 2; + } +} + -rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage); +rpc Subscribe(stream ReplicaMessage) returns (stream ReplicationStreamMessage);

The replica sends handshake as the first message, then sends ACKs. The primary can validate term and resume from the correct sequence number.

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 326 - 335, Update the Subscribe RPC to accept a client-side framed message that includes an explicit initial handshake before ACKs: change the rpc signature to stream ReplicationClientMessage so the replica can send a ReplicationHandshake (with term, last_ack_sequence, state enum e.g. SYNCING/REPLICATING) as the first message followed by ReplicationAck messages; add the ReplicationClientMessage wrapper (oneof { ReplicationHandshake, ReplicationAck }) and ensure server-side Subscribe handler validates the handshake fields (term and last_ack_sequence) and uses them to choose resume point or reject/migrate the replica before processing subsequent ACKs.

docs/proposals/ha-failover.md

coderabbitai · 2026-01-28T15:01:52Z

docs/proposals/ha-failover.md

+```
+Time    PR1 (primary)              PR2 (replica)         GSLB
+─────────────────────────────────────────────────────────────────
+T+0     Can't reach PR2            Can't reach PR1       PR1 HEALTHY
+T+15s   No ACKs, probes fail
+        → Stays ACTIVE             -                     PR1 HEALTHY
+        (thinks PR2 dead)
+T+30s   -                          failoverTimeout
+                                   Probes PR1: fail
+                                   → STANDBY             PR2 HEALTHY
+
+        Both HEALTHY! GSLB arbitrates.
+
+        Route53 PRIMARY/SECONDARY:
+        → Always returns PR1 (PRIMARY wins)
+        → Agents go to PR1
+        → PR2 sits idle in STANDBY
+
+        When partition heals:
+        → PR2 sees PR1 is ACTIVE
+        → PR2 reconnects as replica
+```


🛠️ Refactor suggestion | 🟠 Major

Elevate GSLB PRIMARY/SECONDARY routing to a hard requirement.

Scenario 4 demonstrates that during full partition, both principals become HEALTHY (line 640). The resolution relies on:

Route53 PRIMARY/SECONDARY: → Always returns PR1 (PRIMARY wins)

This GSLB behavior is critical for correctness, not just a deployment option. Without PRIMARY/SECONDARY routing (or equivalent tie-breaking), agents could be load-balanced across both principals during a partition, causing split-brain at the application level.

Move this GSLB configuration requirement to:

Design Decisions table (lines 11-17) as a new row

Requirements section (lines 910-917) with MUST language

Example addition to requirements:

| GSLB tie-breaking | MUST use PRIMARY/SECONDARY or equivalent | Prevents dual-active during full partition |

In the GSLB setup section (lines 906-974), add a warning:

⚠️ **Critical**: GSLB MUST implement deterministic tie-breaking (PRIMARY/SECONDARY, weighted routing with weights 100:0, or geoproximity with strict priority). Simple round-robin or least-connections will cause split-brain during full network partition.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

629-629: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 629 - 650, The document currently treats GSLB PRIMARY/SECONDARY routing as optional but the review shows it's a hard correctness requirement; update the Design Decisions table by adding a new row titled "GSLB tie-breaking" with the decision "MUST use PRIMARY/SECONDARY or equivalent" and rationale "Prevents dual-active during full partition", update the Requirements section to include a MUST requirement for deterministic GSLB tie-breaking (use PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity) using MUST language, and add a prominent warning in the GSLB setup section stating that deterministic tie-breaking is critical (no round-robin/least-connections) and listing acceptable options so operators cannot deploy without a tie-breaker.

coderabbitai · 2026-01-28T15:01:52Z

docs/proposals/ha-failover.md

+```
+Agent                          Principal
+─────────────────────────────────────────────
+1. Connect to GSLB endpoint
+   (DNS resolves to healthy principal)
+
+2. gRPC handshake
+   AgentHello{term: <last known or 0>}
+                               ──► If term matches: OK
+                               ──► If term stale: FailedPrecondition
+
+3. On FailedPrecondition:
+   - Clear cached term
+   - Reconnect (get new term from principal)
+
+4. On disconnect (principal fenced/failed):
+   - Backoff + retry
+   - DNS may resolve to different principal
+   - New handshake gets new term
+```


🛠️ Refactor suggestion | 🟠 Major

Term persistence is required, not optional, to prevent stale term acceptance.

The handshake flow shows term validation (lines 688-689), but doesn't handle the case where the agent's term is newer than the principal's term.

Scenario:

Principal is ACTIVE with term=5

Agent connects, caches term=5

Principal crashes and restarts

If term is not persisted, principal starts with term=1 (or 0)

Agent reconnects with cached term=5

Principal with term=1 accepts the connection (agent's term doesn't match, but principal has no way to know if 5 is stale or valid)

Risk: The principal might accept connections from agents carrying state from a future term it hasn't reached yet, causing state inconsistency.

Make term persistence mandatory, not optional (resolve Open Question #4 at line 1108)

Add validation: Principal should reject connections where agent_term > principal_term with a specific error like FAILED_PRECONDITION: agent term too new, principal restarted

Agent behavior: On "term too new" error, agent should clear its cached term and reconnect

Update the handshake flow:

2. gRPC handshake AgentHello{term: <last known or 0>} - ──► If term matches: OK - ──► If term stale: FailedPrecondition + ──► If term matches: OK + ──► If agent_term < principal_term: FailedPrecondition (stale) + ──► If agent_term > principal_term: FailedPrecondition (principal restarted)

As per the design, addressing Open Question #4 should state that term MUST be persisted.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

680-680: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 680 - 699, Make term persistence mandatory (resolve Open Question `#4`) and update the handshake: when receiving AgentHello include a strict validation in the principal's handshake logic to compare agent_term vs principal_term and return FAILED_PRECONDITION with a clear message like "agent term too new, principal restarted" if agent_term > principal_term; ensure the principal refuses the connection in that case. On the agent side, update the gRPC handshake retry handling so that on this specific "term too new" error the agent clears its cached term and reconnects to fetch the new principal term (same reconnection/backoff flow applies). Also document that term MUST be persisted by the principal so it cannot regress after restart.

coderabbitai · 2026-01-28T15:01:52Z

docs/proposals/ha-failover.md

+### Failback Command
+
+```bash
+# Check current HA status
+$ argocd-agent ha status
+Region A (principal.region-a.internal):
+  State: STANDBY
+  Role: preferred-primary
+  Term: 2
+  Replication: synced (lag: 0s)
+  Last event: 2024-01-15T10:30:00Z
+
+Region B (principal.region-b.internal):
+  State: ACTIVE
+  Role: preferred-replica
+  Term: 2
+  Connected agents: 47
+  Uptime: 2h 15m
+
+# Trigger failback (requires confirmation)
+$ argocd-agent ha failback --to region-a
+
+WARNING: This will:
+  1. Promote Region A to primary (term → 3)
+  2. Demote Region B to replica
+  3. Cause all 47 agents to reconnect
+
+Region A replication status: SYNCED (lag: 0s)
+
+Proceed? [y/N]: y
+
+[1/3] Signaling Region B to become replica... done
+[2/3] Signaling Region A to become primary (term=3)... done
+[3/3] Waiting for agents to reconnect... 47/47 connected
+
+Failback complete. Region A is now primary (term=3).
+```


🛠️ Refactor suggestion | 🟠 Major

Ensure failback command is atomic and validates prerequisites.

The failback command (lines 817-834) sends signals to both regions:

[1/3] Signaling Region B to become replica... done [2/3] Signaling Region A to become primary (term=3)... done

Risk: If network partition between regions is still present or intermittent:

Step [1/3] might fail silently or timeout

Step [2/3] succeeds

Both regions think they're primary → split-brain

Add prerequisites validation before execution:

$ argocd-agent ha failback --to region-a +Validating prerequisites... +✓ Region A is reachable +✓ Region B is reachable +✓ Region A ↔ Region B connectivity OK +✓ Region A replication lag: 0s (SYNCED) + WARNING: This will: 1. Promote Region A to primary (term → 3) 2. Demote Region B to replica 3. Cause all 47 agents to reconnect Region A replication status: SYNCED (lag: 0s) Proceed? [y/N]: y

If any prerequisite fails, abort with clear error. This prevents operator from executing failback during ongoing partition or connectivity issues.

Additionally, the command should:

Execute as a transaction (both regions agree before state changes)

Have a timeout and rollback mechanism if step [2/3] fails after [1/3] succeeds

Verify the new primary (Region A) receives ACKs from new replica (Region B) before completing

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 798 - 834, Add preflight checks and make the "ha failback" command atomic by validating prerequisites (cluster connectivity, replication synced, quorum, and agent connectivity) before executing the two-step signal sequence shown in the failback example; implement a two-phase coordination where both regions agree to the state change (prepare/commit), include configurable timeouts, explicit error returns on any network/partition failures, and a rollback path that reverts any partial changes if the commit to Region A (promote) fails after demoting Region B, and ensure the command waits for and verifies ACKs from Region B (replica) to Region A (new primary) before reporting success.

Adds alternative to the self-fencing design (Option A) that eliminates split-brain complexity by making all promotion decisions external. Two modes: - Manual: operator runs promote/demote + updates DNS. Works with any DNS provider. No automatic failover, no fencing, no terms. - Coordinator: pluggable interface (Route53 ARC, etc.) polls external authority to determine active principal. Reduces state machine from 7 states to 4, removes heartbeat/ACK protocol, self-fencing, and term/epoch system entirely."

drewbailey · 2026-02-06T16:29:03Z

docs/proposals/ha-failover.md

+
+---
+
+## Option B: Simplified Operations Model (Recommended)


@jannfis I think self-fencing, terms/epochs etc got too complex.. If we rely on external promotion decisions (manual or something like route53 ARC) the principle logic is simpler

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/proposals/ha-failover.md`:
- Around line 780-783: The document has an inconsistency: the failback timeline
shows a recovered node in State: STANDBY with Health: UNHEALTHY, but the
health-check table declares STANDBY as HEALTHY; update the docs to make these
consistent by choosing one behavior and applying it everywhere (e.g., if
failback requires operator promotion, change the health-check table entry for
STANDBY to UNHEALTHY or introduce a distinct state like STANDBY_SYNCED_NOT_READY
and mark that UNHEALTHY), and add a short note in the failback sequence (the
"Caught up!" step) referencing the chosen rule (preferredRole/manual-ready flag)
so readers know when GSLB may start routing to STANDBY.
- Around line 1130-1151: The Promote() logic must treat an unreachable peer as a
blocking condition to avoid split-brain: update Controller.Promote to consider
err != nil from checkPeerStatus() as a fatal precondition (reject promotion with
a clear error referencing c.options.PeerAddress) unless the caller passed force;
keep the existing force behavior (log a warning and proceed), and still perform
c.transitionTo(StateActive) only after the reachable-peer check (use
checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate
the change).

🧹 Nitpick comments (2)

docs/proposals/ha-failover.md (2)
11-17: Design Decisions table only reflects Option A.

The table at the top of the document describes "Self-fencing on lost peer ACK" as the split-brain prevention mechanism, but Option B (lines 1083–1321) proposes removing self-fencing entirely in favor of external authority-based promotion. This creates confusion about which design is being proposed.

Recommendation: Either:

Add a note above this table: "Note: This table describes Option A. See Option B starting at line 1083 for an alternative simplified model."

Create two separate tables (one for Option A, one for Option B)

Move Option B to a separate document if it's a competing proposal

This will help readers understand they're evaluating two distinct architectures.

338-340: Consider adding term to SnapshotRequest for consistency.

The SnapshotRequest message only includes since_sequence_num, but no term field. This means when a principal calls GetSnapshot, the serving principal cannot validate whether the requester is operating on a stale term.

Scenario:

PR1 is ACTIVE with term=5

PR1 crashes and comes back with term=1 (if term not persisted)

PR2 is now ACTIVE with term=6

PR1 calls GetSnapshot with since_sequence_num from its old state

PR2 has no way to reject this stale request

While this might auto-correct during subsequent replication, explicitly validating term at snapshot time would catch stale principals earlier.

Recommendation:
 message SnapshotRequest {
     uint64 since_sequence_num = 1;
+    uint64 term = 2;
 }
And in the implementation, reject requests with term < current term.

Note: This is closely related to the term persistence issue (Open Question #4, line 1332, already flagged in past comments). If term is persisted, this becomes less critical.

docs/proposals/ha-failover.md

coderabbitai · 2026-02-06T16:33:18Z

docs/proposals/ha-failover.md

+### Safety: Promote Refuses if Peer is ACTIVE
+
+The `promote` command checks peer state before allowing promotion. This prevents the most common operator error (promoting without demoting the other side):
+
+```go
+func (c *Controller) Promote(ctx context.Context, force bool) error {
+    if c.state == StateActive {
+        return fmt.Errorf("already active")
+    }
+
+    peer, err := c.checkPeerStatus(ctx)
+    if err == nil && peer.State == StateActive {
+        if !force {
+            return fmt.Errorf("peer is ACTIVE at %s — demote peer first, or use --force", c.options.PeerAddress)
+        }
+        log().Warn("Force-promoting while peer is ACTIVE")
+    }
+
+    c.transitionTo(StateActive)
+    return nil
+}
+```


⚠️ Potential issue | 🔴 Critical

Option B: Safety check cannot prevent stale principal during partition.

The Promote() safety check (lines 1140-1146) prevents promoting when the peer is reachable and ACTIVE, but doesn't handle the partition case where the old primary is unreachable:

Scenario:

PR1 is ACTIVE, serving agents

Network partition: PR1 and PR2 cannot reach each other

Operator runs promote on PR2

Line 1140: checkPeerStatus() fails with error (peer unreachable)

Line 1141: err == nil is false, so the peer.State == StateActive check is skipped

Line 1148: PR2 transitions to ACTIVE

Both PR1 and PR2 are now ACTIVE → split-brain

Since Option B removes terms/epochs (line 1096), there's no mechanism to reject stale connections. If agents reconnect to PR1 after the partition heals, PR1 has no way to know it's been demoted.

Possible solutions:

Require reachable peer for promotion: Change line 1141 to reject if err != nil (peer must be reachable to confirm it's not ACTIVE)

Add back minimal term validation even in Option B: Increment a version on promotion, include in agent handshakes

Document as known limitation: State that Option B requires manual verification that old primary is truly dead before promoting

Recommend addressing this before Option B can be considered a complete alternative.

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 1130 - 1151, The Promote() logic must treat an unreachable peer as a blocking condition to avoid split-brain: update Controller.Promote to consider err != nil from checkPeerStatus() as a fatal precondition (reject promotion with a clear error referencing c.options.PeerAddress) unless the caller passed force; keep the existing force behavior (log a warning and proceed), and still perform c.transitionTo(StateActive) only after the reachable-peer check (use checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate the change).

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/proposals/ha-failover.md`:
- Around line 21-52: The markdown file docs/proposals/ha-failover.md contains
multiple fenced code blocks (ASCII diagrams and console transcripts) without a
language tag, triggering markdownlint; update each triple-backtick fence that
encloses the diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram
block, the RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and
the argocd-agentctl/console snippets) to use a language tag such as "text"
(change ``` to ```text) so linting and readability are improved while preserving
content.

coderabbitai · 2026-02-11T19:18:01Z

docs/proposals/ha-failover.md

+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Global DNS / GSLB                           │
+│                principal.argocd.example.com                     │
+│           Health checks: /healthz (200 only when ACTIVE)        │
+└──────────────┬──────────────────────────────┬───────────────────┘
+               │                              │
+               ▼                              ▼
+  ┌──────────────────────┐      ┌──────────────────────┐
+  │   REGION A (Primary) │      │  REGION B (Replica)  │
+  │                      │      │                      │
+  │  Principal Server    │◄─────│  Replication Client  │
+  │  - gRPC :8403        │      │  - gRPC :8403        │
+  │  - Replication :8404 │─────►│  - Mirrors state     │
+  │  - /healthz :8080    │      │  - /healthz :8080    │
+  │                      │      │                      │
+  │  ArgoCD Instance     │      │  ArgoCD Instance     │
+  │  (source of truth)   │      │  (standby, receives  │
+  │                      │      │   replicated state)  │
+  └──────────────────────┘      └──────────────────────┘
+               ▲                              ▲
+               │    Replication Stream:        │
+               │    All events forwarded       │
+               │    Primary → Replica          │
+               │                               │
+       ┌───────┴───────────────────────────────┴───────┐
+       │              Remote Clusters                   │
+       │    ┌─────────┐  ┌─────────┐  ┌─────────┐      │
+       │    │ Agent 1 │  │ Agent 2 │  │ Agent N │      │
+       │    └─────────┘  └─────────┘  └─────────┘      │
+       └───────────────────────────────────────────────┘
+```


⚠️ Potential issue | 🟡 Minor

Add languages to fenced code blocks.

markdownlint warns on these blocks; adding languages improves readability and linting compliance.
Suggested fixes:

🛠️ Proposed diff

-``` +```text ┌─────────────────────────────────────────────────────────────────┐ ... -``` +``` -``` +```text RECOVERING ──┬── config=primary & peer not ACTIVE ──→ ACTIVE ... -``` +``` -``` +```text T+0s Primary dies. Replica detects stream break → DISCONNECTED. ... -``` +``` -``` +```text $ argocd-agentctl ha demote # on Region A — drops agents, stops serving ... -``` +``` -``` +```text T+0 Old primary restarts → RECOVERING → SYNCING (peer is ACTIVE) ... -``` +```

Also applies to: 62-72, 135-142, 148-152, 156-164

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 21-21: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/proposals/ha-failover.md` around lines 21 - 52, The markdown file docs/proposals/ha-failover.md contains multiple fenced code blocks (ASCII diagrams and console transcripts) without a language tag, triggering markdownlint; update each triple-backtick fence that encloses the diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram block, the RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and the argocd-agentctl/console snippets) to use a language tag such as "text" (change ``` to ```text) so linting and readability are improved while preserving content.

proposal for ha failover design

c418424

drewbailey requested review from chetan-rns, jannfis, jgwest and mikeshng as code owners January 27, 2026 15:38

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

drewbailey mentioned this pull request Jan 27, 2026

Multi-Region HA / Hot-stand by support #720

Closed

jannfis reviewed Jan 27, 2026

View reviewed changes

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

drewbailey commented Feb 6, 2026

View reviewed changes

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

simplify design for initial proposal

d601a91

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

		1. Replica knows it's a replica - Won't accept agents while replication stream is healthy
		2. Failover requires replication stream break - Only then does replica become active

proposal for ha failover design #723

Are you sure you want to change the base?

proposal for ha failover design #723

Conversation

drewbailey commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

jannfis Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

drewbailey Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

drewbailey Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewbailey commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

drewbailey Jan 27, 2026 •

edited

Loading