Skip to content

Conversation

@drewbailey
Copy link
Contributor

@drewbailey drewbailey commented Jan 27, 2026

PR form of #720 for easier commenting / collaboration

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive HA failover guide covering Active/Passive cross-region disaster recovery, a multi-state HA controller model, replication and recovery flows, failover/failback scenarios and timing, configuration examples, GSLB/DNS routing and health checks, observability and alerts, operational limitations (lag, split‑brain risks) and proposals for future enhancements.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds a new proposal at docs/proposals/ha-failover.md describing an Active/Passive HA failover design for the ArgoCD agent: cross-region replication, principal-to-principal streaming proto (Subscribe/GetSnapshot/Status), a multi-state HA controller, failover/failback flows, GSLB/DNS guidance, observability, limitations, and future work. (50 words)

Changes

Cohort / File(s) Summary
HA Failover Design Documentation
docs/proposals/ha-failover.md
New comprehensive design doc (+354 lines) detailing cross-region disaster recovery, principal-to-principal replication (Subscribe, GetSnapshot, Status RPCs), replication message formats (ReplicatedEvent, ReplicationSnapshot, AgentState, Resource, ReplicationAck, ReplicationStatus), a 5‑state HA controller, failover/failback sequences, agent reconnection, GSLB/DNS and healthcheck guidance, observability/alerts, limitations, and future work.

Sequence Diagram(s)

(omitted — documentation-only addition; no runtime/control-flow code changes requiring diagrams)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through pages, stitched each line,
Regions paired and heartbeats all align,
States that tumble, then stand tall and true,
Replicas whisper, "I’ve got you,"
A carrot crown for safe rollout too. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'proposal for ha failover design' directly and clearly describes the main change - adding a comprehensive HA failover design proposal documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 190-258: Add a security section describing how principals
authenticate and authorize each other for the Replication service (service
Replication, rpc Subscribe, rpc GetSnapshot, messages ReplicatedEvent and
ReplicationSnapshot) by requiring mutual TLS for all replication connections,
documenting certificate issuance, rotation/expiry handling and trust roots, and
specifying an authorization model (e.g., agent identity in cert CN/SAN mapped to
allowed replicas/agents and per-peer ACLs or RBAC) plus recommendations for
logging and auditing of failed auth attempts; include guidance on secret
storage, automated rotation, and fallback/renewal behavior to ensure continuous
authenticated replication.
- Around line 666-667: The "Replication Lag Monitoring" section is empty; please
add guidance that defines the key metrics to expose (replication lag time,
pending event count, last replicated sequence/LSN), recommends alerting
thresholds (e.g., warning at X seconds/minutes and critical at Y, or pending
events > N), and provides troubleshooting steps (check network/CPU/IO, review
producer/consumer backpressure, inspect WAL/streaming errors, and steps to
resync or promote replicas). Reference the section header "Replication Lag
Monitoring" so the new content sits directly beneath it and use concrete
examples/thresholds and commands/log locations relevant to your HA
implementation.
- Line 3: Fix the typo in the AI disclosure line: locate the string "_ai
disclosure: diagrams and tables were formatted/generatd with ai_" and change
"generatd" to "generated" so the line reads "_ai disclosure: diagrams and tables
were formatted/generated with ai_" (or keep the original spacing/punctuation
style) to correct the misspelling.
- Around line 703-706: Replace the four broken relative links ([argocd-agent
Architecture], [Event System], [Queue Implementation], [Resync Protocol]) in the
proposal with TODO placeholders noting the implementation/link will be added
later; specifically remove or replace the paths "../README.md",
"../internal/event/event.go", "../internal/queue/queue.go",
"../internal/resync/resync.go" with explanatory TODO text (e.g., "TODO: link to
implementation or API docs when available") so the proposal no longer references
non-documentation files and the pipeline validation passes.
🧹 Nitpick comments (1)
docs/proposals/ha-failover.md (1)

19-67: Optional: Add language specifier to code blocks.

Consider adding text or ascii as the language specifier for ASCII diagram blocks to satisfy markdown linters. This applies to several diagram blocks throughout the document (lines 19-67, 81-123, 268-314, 344-370, 391-419).

📋 Example fix
-```
+```text
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                           Global DNS / GSLB                                 │

Comment on lines 703 to 706
- [argocd-agent Architecture](../README.md)
- [Event System](../internal/event/event.go)
- [Queue Implementation](../internal/queue/queue.go)
- [Resync Protocol](../internal/resync/resync.go)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix broken documentation references.

The pipeline validation fails because these relative paths point to code files, not documentation. Since this is a proposal document, these references should either:

  1. Use repository URLs to link to code (if already merged)
  2. Reference API documentation (if generated docs exist)
  3. Be removed or marked as TODO (if the code doesn't exist yet)

For a design proposal, option 3 is most appropriate. These can be added when the implementation is complete.

🔧 Proposed fix (mark as TODO for future implementation)
 ## References
 
-- [argocd-agent Architecture](../README.md)
-- [Event System](../internal/event/event.go)
-- [Queue Implementation](../internal/queue/queue.go)
-- [Resync Protocol](../internal/resync/resync.go)
+- argocd-agent Architecture (see main README.md)
+- Event System implementation (TODO: link when implemented)
+- Queue Implementation (TODO: link when implemented)
+- Resync Protocol (TODO: link when implemented)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- [argocd-agent Architecture](../README.md)
- [Event System](../internal/event/event.go)
- [Queue Implementation](../internal/queue/queue.go)
- [Resync Protocol](../internal/resync/resync.go)
- argocd-agent Architecture (see main README.md)
- Event System implementation (TODO: link when implemented)
- Queue Implementation (TODO: link when implemented)
- Resync Protocol (TODO: link when implemented)
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 703 - 706, Replace the four
broken relative links ([argocd-agent Architecture], [Event System], [Queue
Implementation], [Resync Protocol]) in the proposal with TODO placeholders
noting the implementation/link will be added later; specifically remove or
replace the paths "../README.md", "../internal/event/event.go",
"../internal/queue/queue.go", "../internal/resync/resync.go" with explanatory
TODO text (e.g., "TODO: link to implementation or API docs when available") so
the proposal no longer references non-documentation files and the pipeline
validation passes.

Comment on lines 672 to 673
1. **Replica knows it's a replica** - Won't accept agents while replication stream is healthy
2. **Failover requires replication stream break** - Only then does replica become active
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the split-brain problem, I can imagine the following scenario between Principal 1 (PR1) and Principal 2 (PR2):

  • PR1 is primary, PR2 is replica
  • PR1 isn't able to communicate to PR2 due to network failure/issue/hiccup
  • Replication stream breaks
  • PR2, not receiving any replication data, assumes primary role
  • PR2, still able to send to (but not receive from) PR1 sends replication data to PR1
  • PR1 still assumes it is primary - split brain

I do not have any idea currently on how to solve this without a quorum (i.e. a third principal or external observer), but that would inherently increase complexity.

Copy link
Contributor Author

@drewbailey drewbailey Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is something I have been trying to grapple with. I think we can prevent split-brain (at the expense of availability) using self-fencing & requiring peer ACKs on replication events.

Primary must continuously prove replica is listening, if there are no ACKs, the principal self-fences.

PR1 (primary):
- Sends heartbeats every 5s on replication stream
- Expects ACKs from PR2
- No ACK for 15s → FENCED state

PR2 (replica):
- No heartbeat/events for 30s → promotes to STANDBY
- Trusts that PR1 will self-fence

I also think there could be a pluggable interface to allow the use of vendor solutions like route53 ARC to be an external witness & help with fencing and coordination.

A new state, fenced would allow for something like

  ACTIVE ──(no ACKs for ackTimeout)──► FENCED
                                          │
                                          ├─ /healthz: UNHEALTHY
                                          ├─ reject new agent connections
                                          ├─ terminate existing agent sessions
                                          │
                                          ▼
                                (peer reconnects as primary)
                                          │
                                          ▼
                                      SYNCING

Failure Scenarios

Scenario 1: PR2 Crashes (Clean Failure)

Time    PR1 (primary)              PR2 (replica)         Agents
─────────────────────────────────────────────────────────────────
T+0     Serving agents             Dies                 Connected to PR1
T+5s    Heartbeat fails            -                    Still on PR1
T+10s   No ACKs...                 -                    Still on PR1
T+15s   FENCED                     -                    Disconnected
        /healthz UNHEALTHY

outage - no healthy principal

I believe this is unavoidable with two nodes & no external witness, PR1 cant distinguish “PR2 is dead” from “PR2 is partitioned”. To prevent split-brain, PR1 must always fence on lost ACKs, even if that causes a brief outage. External witness / something like ARC can help in this failure mode

Scenario 2: Asymmetric Partition (PR1 → PR2 Broken)

  Time    PR1 (primary)              PR2 (replica)         DNS/GSLB
  ─────────────────────────────────────────────────────────────────
  T+0     Can't reach PR2            Replication breaks    PR1 healthy
          Heartbeats fail            Starts 30s timer
  T+15s   No ACKs → FENCED           Waiting...            PR1 UNHEALTHY
          Terminates agents                                PR2 healthy
  T+30s   -                          Promotes → STANDBY    Routes to PR2
  T+45s   -                          Agent connects
          -                          → ACTIVE              Stable on PR2

  Result: Clean failover, no split-brain. PR1 fenced before PR2 promoted.
  Scenario 3: Full Partition (Neither Can Reach Other)

  Time    PR1 (primary)              PR2 (replica)         DNS/GSLB
  ─────────────────────────────────────────────────────────────────
  T+0     Can't reach PR2            Can't receive         Depends on
          No ACKs                    from PR1              GSLB's view
  T+15s   Probes PR2: unreachable    Waiting...
          Stays ACTIVE
  T+30s   Still ACTIVE               Probes PR1:
                                     unreachable
                                     Promotes → STANDBY

            Both think peer is dead - potential dual-active

Mitigation relies on DNS/GSLB. Even if both are ACTIVE-capable:

  • GSLB health checks determine who gets traffic
  • Term/epoch rejects stale updates
  • When partition heals, one demotes based on term comparison
  Scenario 4: Primary Network Blip (Transient)

  Time    PR1 (primary)              PR2 (replica)
  ─────────────────────────────────────────────────────────
  T+0     Network hiccup             Replication pauses
  T+5s    Still no ACKs              Waiting...
  T+10s   Network recovers           Receives heartbeat
          Receives ACK               Sends ACK
          Stays ACTIVE               Stays REPLICATING

  Result: No state change. ackTimeout (15s) > typical network blip.

Propose a self fencing mechanism, primary must receive acks from
replica, or fence itself to prevent dual active principles during
partitions.

- Add FENCED state to HA state machine with recovery logic
- Add term/epoch for rejecting stale updates from old leaders
- Add bidirectional heartbeat on replication stream
- Add failure scenario analysis (5 scenarios documented)
- Add agent behavior section: term in handshake, reconnect on stale
- Add optional TrafficAuthority interface for external witnesses (ARC)
- Update proto with heartbeat message and term fields
- Update configuration with new timeouts (ackTimeout, recoveryTimeout)
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 629-650: The document currently treats GSLB PRIMARY/SECONDARY
routing as optional but the review shows it's a hard correctness requirement;
update the Design Decisions table by adding a new row titled "GSLB tie-breaking"
with the decision "MUST use PRIMARY/SECONDARY or equivalent" and rationale
"Prevents dual-active during full partition", update the Requirements section to
include a MUST requirement for deterministic GSLB tie-breaking (use
PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity) using MUST language,
and add a prominent warning in the GSLB setup section stating that deterministic
tie-breaking is critical (no round-robin/least-connections) and listing
acceptable options so operators cannot deploy without a tie-breaker.
- Around line 461-470: Clarify the term increment semantics: state that
automatic promotion (failover) increments term when a STANDBY node transitions
to ACTIVE upon the first agent connect, whereas manual failback increments term
when the operator issues the command (referenced as the "argocd-agent ha
failback" / "become primary" step); update the STANDBY row in the table (the
"STANDBY: Accept → Transition to ACTIVE (term++)" text) to explicitly state this
distinction or modify the manual failback sequence example to show that the
operator command performs the term++ and then transitions state to ACTIVE so
both descriptions are consistent.
- Around line 326-335: Update the Subscribe RPC to accept a client-side framed
message that includes an explicit initial handshake before ACKs: change the rpc
signature to stream ReplicationClientMessage so the replica can send a
ReplicationHandshake (with term, last_ack_sequence, state enum e.g.
SYNCING/REPLICATING) as the first message followed by ReplicationAck messages;
add the ReplicationClientMessage wrapper (oneof { ReplicationHandshake,
ReplicationAck }) and ensure server-side Subscribe handler validates the
handshake fields (term and last_ack_sequence) and uses them to choose resume
point or reject/migrate the replica before processing subsequent ACKs.
- Around line 539-563: The unfencing logic in shouldUnfence currently returns
true based only on gslbResolvesToMe() || gslbHasNoHealthyTarget(), which risks
split-brain and references an undefined gslbHasNoHealthyTarget; update
shouldUnfence to (1) implement or remove gslbHasNoHealthyTarget() and (2) add an
explicit peer liveness/state probe (e.g., implement a new method like
probePeerIsActive() or reuse existing peer-check utilities) and require that
probePeerIsActive() == false before allowing unfence; ensure you call the probe
from shouldUnfence (alongside gslbResolvesToMe()) so the fenced node only
unfences if DNS conditions are met AND the peer is confirmed not ACTIVE.
- Around line 798-834: Add preflight checks and make the "ha failback" command
atomic by validating prerequisites (cluster connectivity, replication synced,
quorum, and agent connectivity) before executing the two-step signal sequence
shown in the failback example; implement a two-phase coordination where both
regions agree to the state change (prepare/commit), include configurable
timeouts, explicit error returns on any network/partition failures, and a
rollback path that reverts any partial changes if the commit to Region A
(promote) fails after demoting Region B, and ensure the command waits for and
verifies ACKs from Region B (replica) to Region A (new primary) before reporting
success.
- Around line 680-699: Make term persistence mandatory (resolve Open Question
`#4`) and update the handshake: when receiving AgentHello include a strict
validation in the principal's handshake logic to compare agent_term vs
principal_term and return FAILED_PRECONDITION with a clear message like "agent
term too new, principal restarted" if agent_term > principal_term; ensure the
principal refuses the connection in that case. On the agent side, update the
gRPC handshake retry handling so that on this specific "term too new" error the
agent clears its cached term and reconnects to fetch the new principal term
(same reconnection/backoff flow applies). Also document that term MUST be
persisted by the principal so it cannot regress after restart.
- Around line 82-92: The self-fencing section is ambiguous about which endpoint
probePeerHealth() checks; update the text to explicitly state that
probePeerHealth() probes the peer's HTTP health endpoint (port 8080, /healthz)
rather than the replication port (8404). Change the three decision bullets (No
ACK for ackTimeout, peer unreachable, peer reachable but not ACKing) to
reference probing :8080/healthz so readers implement the probe correctly and
avoid confusing replication-port failures with a dead peer.
🧹 Nitpick comments (1)
docs/proposals/ha-failover.md (1)

198-208: Document operational considerations for replication bandwidth and backpressure.

The replication design sends ALL events across ALL agents (line 151) with full CloudEvents payloads. Missing considerations:

Bandwidth & Scalability:

  • How much bandwidth is required for N agents? (estimates/benchmarks)
  • Is cross-region bandwidth cost acceptable?
  • Are there compression or batching strategies?

Backpressure:

  • What happens if the replica falls behind (network slowdown, resource constraints)?
  • Does the primary buffer events? For how long?
  • Is there a circuit breaker to fence the primary if replication lag exceeds a threshold?

Memory:

  • In-memory buffering of pending events could grow unbounded
  • Need limits and overflow handling

These operational concerns directly impact the "1-5 minute RTO" design goal and could cause split-brain if not handled correctly (e.g., primary continues serving while replica has stale state).

Consider documenting:

  1. Expected replication bandwidth per agent
  2. Maximum tolerable replication lag before fencing
  3. Backpressure strategy (block writes? shed load? fence?)
  4. Memory limits for replication buffers

Comment on lines 326 to 335
service Replication {
// Subscribe to replication stream (replica calls this on primary)
// Bidirectional: primary sends events/heartbeats, replica sends ACKs
rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);

// Get initial snapshot (replica calls on connect)
rpc GetSnapshot(SnapshotRequest) returns (ReplicationSnapshot);

// Get replication status
rpc Status(StatusRequest) returns (ReplicationStatus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add initial handshake message to Subscribe RPC for replica state synchronization.

The Subscribe RPC is bidirectional but lacks an initial handshake from the replica. Currently:

rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);

Problem: The primary doesn't know the replica's:

  • Current term (to detect term mismatch immediately)
  • Last acknowledged sequence number (to resume from correct point)
  • State (SYNCING, REPLICATING, etc.)

Risk: After network reconnection, the primary may send events the replica already has or skip events it missed, causing state divergence.

Recommended proto enhancement
+message ReplicationHandshake {
+    uint64 replica_term = 1;
+    uint64 last_acked_sequence_num = 2;
+    string replica_state = 3;
+}
+
+message ReplicaMessage {
+    oneof payload {
+        ReplicationHandshake handshake = 1;
+        ReplicationAck ack = 2;
+    }
+}
+
-rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);
+rpc Subscribe(stream ReplicaMessage) returns (stream ReplicationStreamMessage);

The replica sends handshake as the first message, then sends ACKs. The primary can validate term and resume from the correct sequence number.

🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 326 - 335, Update the Subscribe
RPC to accept a client-side framed message that includes an explicit initial
handshake before ACKs: change the rpc signature to stream
ReplicationClientMessage so the replica can send a ReplicationHandshake (with
term, last_ack_sequence, state enum e.g. SYNCING/REPLICATING) as the first
message followed by ReplicationAck messages; add the ReplicationClientMessage
wrapper (oneof { ReplicationHandshake, ReplicationAck }) and ensure server-side
Subscribe handler validates the handshake fields (term and last_ack_sequence)
and uses them to choose resume point or reject/migrate the replica before
processing subsequent ACKs.

Comment on lines 629 to 650
```
Time PR1 (primary) PR2 (replica) GSLB
─────────────────────────────────────────────────────────────────
T+0 Can't reach PR2 Can't reach PR1 PR1 HEALTHY
T+15s No ACKs, probes fail
→ Stays ACTIVE - PR1 HEALTHY
(thinks PR2 dead)
T+30s - failoverTimeout
Probes PR1: fail
→ STANDBY PR2 HEALTHY

Both HEALTHY! GSLB arbitrates.

Route53 PRIMARY/SECONDARY:
→ Always returns PR1 (PRIMARY wins)
→ Agents go to PR1
→ PR2 sits idle in STANDBY

When partition heals:
→ PR2 sees PR1 is ACTIVE
→ PR2 reconnects as replica
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Elevate GSLB PRIMARY/SECONDARY routing to a hard requirement.

Scenario 4 demonstrates that during full partition, both principals become HEALTHY (line 640). The resolution relies on:

Route53 PRIMARY/SECONDARY:
→ Always returns PR1 (PRIMARY wins)

This GSLB behavior is critical for correctness, not just a deployment option. Without PRIMARY/SECONDARY routing (or equivalent tie-breaking), agents could be load-balanced across both principals during a partition, causing split-brain at the application level.

Move this GSLB configuration requirement to:

  1. Design Decisions table (lines 11-17) as a new row
  2. Requirements section (lines 910-917) with MUST language

Example addition to requirements:

| GSLB tie-breaking | MUST use PRIMARY/SECONDARY or equivalent | Prevents dual-active during full partition |

In the GSLB setup section (lines 906-974), add a warning:

⚠️ **Critical**: GSLB MUST implement deterministic tie-breaking (PRIMARY/SECONDARY, weighted routing with weights 100:0, or geoproximity with strict priority). Simple round-robin or least-connections will cause split-brain during full network partition.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

629-629: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 629 - 650, The document currently
treats GSLB PRIMARY/SECONDARY routing as optional but the review shows it's a
hard correctness requirement; update the Design Decisions table by adding a new
row titled "GSLB tie-breaking" with the decision "MUST use PRIMARY/SECONDARY or
equivalent" and rationale "Prevents dual-active during full partition", update
the Requirements section to include a MUST requirement for deterministic GSLB
tie-breaking (use PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity)
using MUST language, and add a prominent warning in the GSLB setup section
stating that deterministic tie-breaking is critical (no
round-robin/least-connections) and listing acceptable options so operators
cannot deploy without a tie-breaker.

Comment on lines 680 to 699
```
Agent Principal
─────────────────────────────────────────────
1. Connect to GSLB endpoint
(DNS resolves to healthy principal)

2. gRPC handshake
AgentHello{term: <last known or 0>}
──► If term matches: OK
──► If term stale: FailedPrecondition

3. On FailedPrecondition:
- Clear cached term
- Reconnect (get new term from principal)

4. On disconnect (principal fenced/failed):
- Backoff + retry
- DNS may resolve to different principal
- New handshake gets new term
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Term persistence is required, not optional, to prevent stale term acceptance.

The handshake flow shows term validation (lines 688-689), but doesn't handle the case where the agent's term is newer than the principal's term.

Scenario:

  1. Principal is ACTIVE with term=5
  2. Agent connects, caches term=5
  3. Principal crashes and restarts
  4. If term is not persisted, principal starts with term=1 (or 0)
  5. Agent reconnects with cached term=5
  6. Principal with term=1 accepts the connection (agent's term doesn't match, but principal has no way to know if 5 is stale or valid)

Risk: The principal might accept connections from agents carrying state from a future term it hasn't reached yet, causing state inconsistency.

  1. Make term persistence mandatory, not optional (resolve Open Question #4 at line 1108)
  2. Add validation: Principal should reject connections where agent_term > principal_term with a specific error like FAILED_PRECONDITION: agent term too new, principal restarted
  3. Agent behavior: On "term too new" error, agent should clear its cached term and reconnect

Update the handshake flow:

 2. gRPC handshake
    AgentHello{term: <last known or 0>}
-                              ──► If term matches: OK
-                              ──► If term stale: FailedPrecondition
+                              ──► If term matches: OK
+                              ──► If agent_term < principal_term: FailedPrecondition (stale)
+                              ──► If agent_term > principal_term: FailedPrecondition (principal restarted)

As per the design, addressing Open Question #4 should state that term MUST be persisted.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

680-680: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 680 - 699, Make term persistence
mandatory (resolve Open Question `#4`) and update the handshake: when receiving
AgentHello include a strict validation in the principal's handshake logic to
compare agent_term vs principal_term and return FAILED_PRECONDITION with a clear
message like "agent term too new, principal restarted" if agent_term >
principal_term; ensure the principal refuses the connection in that case. On the
agent side, update the gRPC handshake retry handling so that on this specific
"term too new" error the agent clears its cached term and reconnects to fetch
the new principal term (same reconnection/backoff flow applies). Also document
that term MUST be persisted by the principal so it cannot regress after restart.

Comment on lines 798 to 834
### Failback Command

```bash
# Check current HA status
$ argocd-agent ha status
Region A (principal.region-a.internal):
State: STANDBY
Role: preferred-primary
Term: 2
Replication: synced (lag: 0s)
Last event: 2024-01-15T10:30:00Z

Region B (principal.region-b.internal):
State: ACTIVE
Role: preferred-replica
Term: 2
Connected agents: 47
Uptime: 2h 15m

# Trigger failback (requires confirmation)
$ argocd-agent ha failback --to region-a

WARNING: This will:
1. Promote Region A to primary (term → 3)
2. Demote Region B to replica
3. Cause all 47 agents to reconnect

Region A replication status: SYNCED (lag: 0s)

Proceed? [y/N]: y

[1/3] Signaling Region B to become replica... done
[2/3] Signaling Region A to become primary (term=3)... done
[3/3] Waiting for agents to reconnect... 47/47 connected

Failback complete. Region A is now primary (term=3).
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Ensure failback command is atomic and validates prerequisites.

The failback command (lines 817-834) sends signals to both regions:

[1/3] Signaling Region B to become replica... done
[2/3] Signaling Region A to become primary (term=3)... done

Risk: If network partition between regions is still present or intermittent:

  • Step [1/3] might fail silently or timeout
  • Step [2/3] succeeds
  • Both regions think they're primary → split-brain

Add prerequisites validation before execution:

 $ argocd-agent ha failback --to region-a
 
+Validating prerequisites...
+✓ Region A is reachable
+✓ Region B is reachable
+✓ Region A ↔ Region B connectivity OK
+✓ Region A replication lag: 0s (SYNCED)
+
 WARNING: This will:
   1. Promote Region A to primary (term → 3)
   2. Demote Region B to replica
   3. Cause all 47 agents to reconnect
 
 Region A replication status: SYNCED (lag: 0s)
 
 Proceed? [y/N]: y

If any prerequisite fails, abort with clear error. This prevents operator from executing failback during ongoing partition or connectivity issues.

Additionally, the command should:

  • Execute as a transaction (both regions agree before state changes)
  • Have a timeout and rollback mechanism if step [2/3] fails after [1/3] succeeds
  • Verify the new primary (Region A) receives ACKs from new replica (Region B) before completing
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 798 - 834, Add preflight checks
and make the "ha failback" command atomic by validating prerequisites (cluster
connectivity, replication synced, quorum, and agent connectivity) before
executing the two-step signal sequence shown in the failback example; implement
a two-phase coordination where both regions agree to the state change
(prepare/commit), include configurable timeouts, explicit error returns on any
network/partition failures, and a rollback path that reverts any partial changes
if the commit to Region A (promote) fails after demoting Region B, and ensure
the command waits for and verifies ACKs from Region B (replica) to Region A (new
primary) before reporting success.

Adds alternative to the self-fencing design (Option A) that eliminates
split-brain complexity by making all promotion decisions external.

Two modes:
- Manual: operator runs promote/demote + updates DNS. Works with any
 DNS provider. No automatic failover, no fencing, no terms.
- Coordinator: pluggable interface (Route53 ARC, etc.) polls external
 authority to determine active principal.

Reduces state machine from 7 states to 4, removes heartbeat/ACK
protocol, self-fencing, and term/epoch system entirely."

---

## Option B: Simplified Operations Model (Recommended)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jannfis I think self-fencing, terms/epochs etc got too complex.. If we rely on external promotion decisions (manual or something like route53 ARC) the principle logic is simpler

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 780-783: The document has an inconsistency: the failback timeline
shows a recovered node in State: STANDBY with Health: UNHEALTHY, but the
health-check table declares STANDBY as HEALTHY; update the docs to make these
consistent by choosing one behavior and applying it everywhere (e.g., if
failback requires operator promotion, change the health-check table entry for
STANDBY to UNHEALTHY or introduce a distinct state like STANDBY_SYNCED_NOT_READY
and mark that UNHEALTHY), and add a short note in the failback sequence (the
"Caught up!" step) referencing the chosen rule (preferredRole/manual-ready flag)
so readers know when GSLB may start routing to STANDBY.
- Around line 1130-1151: The Promote() logic must treat an unreachable peer as a
blocking condition to avoid split-brain: update Controller.Promote to consider
err != nil from checkPeerStatus() as a fatal precondition (reject promotion with
a clear error referencing c.options.PeerAddress) unless the caller passed force;
keep the existing force behavior (log a warning and proceed), and still perform
c.transitionTo(StateActive) only after the reachable-peer check (use
checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate
the change).
🧹 Nitpick comments (2)
docs/proposals/ha-failover.md (2)

11-17: Design Decisions table only reflects Option A.

The table at the top of the document describes "Self-fencing on lost peer ACK" as the split-brain prevention mechanism, but Option B (lines 1083–1321) proposes removing self-fencing entirely in favor of external authority-based promotion. This creates confusion about which design is being proposed.

Recommendation: Either:

  1. Add a note above this table: "Note: This table describes Option A. See Option B starting at line 1083 for an alternative simplified model."
  2. Create two separate tables (one for Option A, one for Option B)
  3. Move Option B to a separate document if it's a competing proposal

This will help readers understand they're evaluating two distinct architectures.


338-340: Consider adding term to SnapshotRequest for consistency.

The SnapshotRequest message only includes since_sequence_num, but no term field. This means when a principal calls GetSnapshot, the serving principal cannot validate whether the requester is operating on a stale term.

Scenario:

  1. PR1 is ACTIVE with term=5
  2. PR1 crashes and comes back with term=1 (if term not persisted)
  3. PR2 is now ACTIVE with term=6
  4. PR1 calls GetSnapshot with since_sequence_num from its old state
  5. PR2 has no way to reject this stale request

While this might auto-correct during subsequent replication, explicitly validating term at snapshot time would catch stale principals earlier.

Recommendation:

 message SnapshotRequest {
     uint64 since_sequence_num = 1;
+    uint64 term = 2;
 }

And in the implementation, reject requests with term < current term.

Note: This is closely related to the term persistence issue (Open Question #4, line 1332, already flagged in past comments). If term is persisted, this becomes less critical.

Comment on lines 1130 to 1151
### Safety: Promote Refuses if Peer is ACTIVE

The `promote` command checks peer state before allowing promotion. This prevents the most common operator error (promoting without demoting the other side):

```go
func (c *Controller) Promote(ctx context.Context, force bool) error {
if c.state == StateActive {
return fmt.Errorf("already active")
}

peer, err := c.checkPeerStatus(ctx)
if err == nil && peer.State == StateActive {
if !force {
return fmt.Errorf("peer is ACTIVE at %s — demote peer first, or use --force", c.options.PeerAddress)
}
log().Warn("Force-promoting while peer is ACTIVE")
}

c.transitionTo(StateActive)
return nil
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Option B: Safety check cannot prevent stale principal during partition.

The Promote() safety check (lines 1140-1146) prevents promoting when the peer is reachable and ACTIVE, but doesn't handle the partition case where the old primary is unreachable:

Scenario:

  1. PR1 is ACTIVE, serving agents
  2. Network partition: PR1 and PR2 cannot reach each other
  3. Operator runs promote on PR2
  4. Line 1140: checkPeerStatus() fails with error (peer unreachable)
  5. Line 1141: err == nil is false, so the peer.State == StateActive check is skipped
  6. Line 1148: PR2 transitions to ACTIVE
  7. Both PR1 and PR2 are now ACTIVE → split-brain

Since Option B removes terms/epochs (line 1096), there's no mechanism to reject stale connections. If agents reconnect to PR1 after the partition heals, PR1 has no way to know it's been demoted.

Possible solutions:

  1. Require reachable peer for promotion: Change line 1141 to reject if err != nil (peer must be reachable to confirm it's not ACTIVE)
  2. Add back minimal term validation even in Option B: Increment a version on promotion, include in agent handshakes
  3. Document as known limitation: State that Option B requires manual verification that old primary is truly dead before promoting

Recommend addressing this before Option B can be considered a complete alternative.

🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 1130 - 1151, The Promote() logic
must treat an unreachable peer as a blocking condition to avoid split-brain:
update Controller.Promote to consider err != nil from checkPeerStatus() as a
fatal precondition (reject promotion with a clear error referencing
c.options.PeerAddress) unless the caller passed force; keep the existing force
behavior (log a warning and proceed), and still perform
c.transitionTo(StateActive) only after the reachable-peer check (use
checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate
the change).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 21-52: The markdown file docs/proposals/ha-failover.md contains
multiple fenced code blocks (ASCII diagrams and console transcripts) without a
language tag, triggering markdownlint; update each triple-backtick fence that
encloses the diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram
block, the RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and
the argocd-agentctl/console snippets) to use a language tag such as "text"
(change ``` to ```text) so linting and readability are improved while preserving
content.

Comment on lines +21 to +52
```
┌─────────────────────────────────────────────────────────────────┐
│ Global DNS / GSLB │
│ principal.argocd.example.com │
│ Health checks: /healthz (200 only when ACTIVE) │
└──────────────┬──────────────────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ REGION A (Primary) │ │ REGION B (Replica) │
│ │ │ │
│ Principal Server │◄─────│ Replication Client │
│ - gRPC :8403 │ │ - gRPC :8403 │
│ - Replication :8404 │─────►│ - Mirrors state │
│ - /healthz :8080 │ │ - /healthz :8080 │
│ │ │ │
│ ArgoCD Instance │ │ ArgoCD Instance │
│ (source of truth) │ │ (standby, receives │
│ │ │ replicated state) │
└──────────────────────┘ └──────────────────────┘
▲ ▲
│ Replication Stream: │
│ All events forwarded │
│ Primary → Replica │
│ │
┌───────┴───────────────────────────────┴───────┐
│ Remote Clusters │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Agent 1 │ │ Agent 2 │ │ Agent N │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────────────────────────────┘
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add languages to fenced code blocks.

markdownlint warns on these blocks; adding languages improves readability and linting compliance.
Suggested fixes:

🛠️ Proposed diff
-```
+```text
┌─────────────────────────────────────────────────────────────────┐
...
-```
+```

-```
+```text
RECOVERING ──┬── config=primary & peer not ACTIVE ──→ ACTIVE
...
-```
+```

-```
+```text
T+0s   Primary dies. Replica detects stream break → DISCONNECTED.
...
-```
+```

-```
+```text
$ argocd-agentctl ha demote    # on Region A — drops agents, stops serving
...
-```
+```

-```
+```text
T+0    Old primary restarts → RECOVERING → SYNCING (peer is ACTIVE)
...
-```
+```

Also applies to: 62-72, 135-142, 148-152, 156-164

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 21-21: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 21 - 52, The markdown file
docs/proposals/ha-failover.md contains multiple fenced code blocks (ASCII
diagrams and console transcripts) without a language tag, triggering
markdownlint; update each triple-backtick fence that encloses the
diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram block, the
RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and the
argocd-agentctl/console snippets) to use a language tag such as "text" (change
``` to ```text) so linting and readability are improved while preserving
content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants