-
Notifications
You must be signed in to change notification settings - Fork 59
proposal for ha failover design #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
proposal for ha failover design #723
Conversation
📝 WalkthroughWalkthroughAdds a new proposal at Changes
Sequence Diagram(s)(omitted — documentation-only addition; no runtime/control-flow code changes requiring diagrams) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 190-258: Add a security section describing how principals
authenticate and authorize each other for the Replication service (service
Replication, rpc Subscribe, rpc GetSnapshot, messages ReplicatedEvent and
ReplicationSnapshot) by requiring mutual TLS for all replication connections,
documenting certificate issuance, rotation/expiry handling and trust roots, and
specifying an authorization model (e.g., agent identity in cert CN/SAN mapped to
allowed replicas/agents and per-peer ACLs or RBAC) plus recommendations for
logging and auditing of failed auth attempts; include guidance on secret
storage, automated rotation, and fallback/renewal behavior to ensure continuous
authenticated replication.
- Around line 666-667: The "Replication Lag Monitoring" section is empty; please
add guidance that defines the key metrics to expose (replication lag time,
pending event count, last replicated sequence/LSN), recommends alerting
thresholds (e.g., warning at X seconds/minutes and critical at Y, or pending
events > N), and provides troubleshooting steps (check network/CPU/IO, review
producer/consumer backpressure, inspect WAL/streaming errors, and steps to
resync or promote replicas). Reference the section header "Replication Lag
Monitoring" so the new content sits directly beneath it and use concrete
examples/thresholds and commands/log locations relevant to your HA
implementation.
- Line 3: Fix the typo in the AI disclosure line: locate the string "_ai
disclosure: diagrams and tables were formatted/generatd with ai_" and change
"generatd" to "generated" so the line reads "_ai disclosure: diagrams and tables
were formatted/generated with ai_" (or keep the original spacing/punctuation
style) to correct the misspelling.
- Around line 703-706: Replace the four broken relative links ([argocd-agent
Architecture], [Event System], [Queue Implementation], [Resync Protocol]) in the
proposal with TODO placeholders noting the implementation/link will be added
later; specifically remove or replace the paths "../README.md",
"../internal/event/event.go", "../internal/queue/queue.go",
"../internal/resync/resync.go" with explanatory TODO text (e.g., "TODO: link to
implementation or API docs when available") so the proposal no longer references
non-documentation files and the pipeline validation passes.
🧹 Nitpick comments (1)
docs/proposals/ha-failover.md (1)
19-67: Optional: Add language specifier to code blocks.Consider adding
textorasciias the language specifier for ASCII diagram blocks to satisfy markdown linters. This applies to several diagram blocks throughout the document (lines 19-67, 81-123, 268-314, 344-370, 391-419).📋 Example fix
-``` +```text ┌─────────────────────────────────────────────────────────────────────────────┐ │ Global DNS / GSLB │
docs/proposals/ha-failover.md
Outdated
| - [argocd-agent Architecture](../README.md) | ||
| - [Event System](../internal/event/event.go) | ||
| - [Queue Implementation](../internal/queue/queue.go) | ||
| - [Resync Protocol](../internal/resync/resync.go) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix broken documentation references.
The pipeline validation fails because these relative paths point to code files, not documentation. Since this is a proposal document, these references should either:
- Use repository URLs to link to code (if already merged)
- Reference API documentation (if generated docs exist)
- Be removed or marked as TODO (if the code doesn't exist yet)
For a design proposal, option 3 is most appropriate. These can be added when the implementation is complete.
🔧 Proposed fix (mark as TODO for future implementation)
## References
-- [argocd-agent Architecture](../README.md)
-- [Event System](../internal/event/event.go)
-- [Queue Implementation](../internal/queue/queue.go)
-- [Resync Protocol](../internal/resync/resync.go)
+- argocd-agent Architecture (see main README.md)
+- Event System implementation (TODO: link when implemented)
+- Queue Implementation (TODO: link when implemented)
+- Resync Protocol (TODO: link when implemented)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - [argocd-agent Architecture](../README.md) | |
| - [Event System](../internal/event/event.go) | |
| - [Queue Implementation](../internal/queue/queue.go) | |
| - [Resync Protocol](../internal/resync/resync.go) | |
| - argocd-agent Architecture (see main README.md) | |
| - Event System implementation (TODO: link when implemented) | |
| - Queue Implementation (TODO: link when implemented) | |
| - Resync Protocol (TODO: link when implemented) |
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 703 - 706, Replace the four
broken relative links ([argocd-agent Architecture], [Event System], [Queue
Implementation], [Resync Protocol]) in the proposal with TODO placeholders
noting the implementation/link will be added later; specifically remove or
replace the paths "../README.md", "../internal/event/event.go",
"../internal/queue/queue.go", "../internal/resync/resync.go" with explanatory
TODO text (e.g., "TODO: link to implementation or API docs when available") so
the proposal no longer references non-documentation files and the pipeline
validation passes.
docs/proposals/ha-failover.md
Outdated
| 1. **Replica knows it's a replica** - Won't accept agents while replication stream is healthy | ||
| 2. **Failover requires replication stream break** - Only then does replica become active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the split-brain problem, I can imagine the following scenario between Principal 1 (PR1) and Principal 2 (PR2):
- PR1 is primary, PR2 is replica
- PR1 isn't able to communicate to PR2 due to network failure/issue/hiccup
- Replication stream breaks
- PR2, not receiving any replication data, assumes primary role
- PR2, still able to send to (but not receive from) PR1 sends replication data to PR1
- PR1 still assumes it is primary - split brain
I do not have any idea currently on how to solve this without a quorum (i.e. a third principal or external observer), but that would inherently increase complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is something I have been trying to grapple with. I think we can prevent split-brain (at the expense of availability) using self-fencing & requiring peer ACKs on replication events.
Primary must continuously prove replica is listening, if there are no ACKs, the principal self-fences.
PR1 (primary):
- Sends heartbeats every 5s on replication stream
- Expects ACKs from PR2
- No ACK for 15s → FENCED state
PR2 (replica):
- No heartbeat/events for 30s → promotes to STANDBY
- Trusts that PR1 will self-fence
I also think there could be a pluggable interface to allow the use of vendor solutions like route53 ARC to be an external witness & help with fencing and coordination.
A new state, fenced would allow for something like
ACTIVE ──(no ACKs for ackTimeout)──► FENCED
│
├─ /healthz: UNHEALTHY
├─ reject new agent connections
├─ terminate existing agent sessions
│
▼
(peer reconnects as primary)
│
▼
SYNCING
Failure Scenarios
Scenario 1: PR2 Crashes (Clean Failure)
Time PR1 (primary) PR2 (replica) Agents
─────────────────────────────────────────────────────────────────
T+0 Serving agents Dies Connected to PR1
T+5s Heartbeat fails - Still on PR1
T+10s No ACKs... - Still on PR1
T+15s FENCED - Disconnected
/healthz UNHEALTHY
outage - no healthy principal
I believe this is unavoidable with two nodes & no external witness, PR1 cant distinguish “PR2 is dead” from “PR2 is partitioned”. To prevent split-brain, PR1 must always fence on lost ACKs, even if that causes a brief outage. External witness / something like ARC can help in this failure mode
Scenario 2: Asymmetric Partition (PR1 → PR2 Broken)
Time PR1 (primary) PR2 (replica) DNS/GSLB
─────────────────────────────────────────────────────────────────
T+0 Can't reach PR2 Replication breaks PR1 healthy
Heartbeats fail Starts 30s timer
T+15s No ACKs → FENCED Waiting... PR1 UNHEALTHY
Terminates agents PR2 healthy
T+30s - Promotes → STANDBY Routes to PR2
T+45s - Agent connects
- → ACTIVE Stable on PR2
Result: Clean failover, no split-brain. PR1 fenced before PR2 promoted.
Scenario 3: Full Partition (Neither Can Reach Other)
Time PR1 (primary) PR2 (replica) DNS/GSLB
─────────────────────────────────────────────────────────────────
T+0 Can't reach PR2 Can't receive Depends on
No ACKs from PR1 GSLB's view
T+15s Probes PR2: unreachable Waiting...
Stays ACTIVE
T+30s Still ACTIVE Probes PR1:
unreachable
Promotes → STANDBY
Both think peer is dead - potential dual-active
Mitigation relies on DNS/GSLB. Even if both are ACTIVE-capable:
- GSLB health checks determine who gets traffic
- Term/epoch rejects stale updates
- When partition heals, one demotes based on term comparison
Scenario 4: Primary Network Blip (Transient)
Time PR1 (primary) PR2 (replica)
─────────────────────────────────────────────────────────
T+0 Network hiccup Replication pauses
T+5s Still no ACKs Waiting...
T+10s Network recovers Receives heartbeat
Receives ACK Sends ACK
Stays ACTIVE Stays REPLICATING
Result: No state change. ackTimeout (15s) > typical network blip.
Propose a self fencing mechanism, primary must receive acks from replica, or fence itself to prevent dual active principles during partitions. - Add FENCED state to HA state machine with recovery logic - Add term/epoch for rejecting stale updates from old leaders - Add bidirectional heartbeat on replication stream - Add failure scenario analysis (5 scenarios documented) - Add agent behavior section: term in handshake, reconnect on stale - Add optional TrafficAuthority interface for external witnesses (ARC) - Update proto with heartbeat message and term fields - Update configuration with new timeouts (ackTimeout, recoveryTimeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 629-650: The document currently treats GSLB PRIMARY/SECONDARY
routing as optional but the review shows it's a hard correctness requirement;
update the Design Decisions table by adding a new row titled "GSLB tie-breaking"
with the decision "MUST use PRIMARY/SECONDARY or equivalent" and rationale
"Prevents dual-active during full partition", update the Requirements section to
include a MUST requirement for deterministic GSLB tie-breaking (use
PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity) using MUST language,
and add a prominent warning in the GSLB setup section stating that deterministic
tie-breaking is critical (no round-robin/least-connections) and listing
acceptable options so operators cannot deploy without a tie-breaker.
- Around line 461-470: Clarify the term increment semantics: state that
automatic promotion (failover) increments term when a STANDBY node transitions
to ACTIVE upon the first agent connect, whereas manual failback increments term
when the operator issues the command (referenced as the "argocd-agent ha
failback" / "become primary" step); update the STANDBY row in the table (the
"STANDBY: Accept → Transition to ACTIVE (term++)" text) to explicitly state this
distinction or modify the manual failback sequence example to show that the
operator command performs the term++ and then transitions state to ACTIVE so
both descriptions are consistent.
- Around line 326-335: Update the Subscribe RPC to accept a client-side framed
message that includes an explicit initial handshake before ACKs: change the rpc
signature to stream ReplicationClientMessage so the replica can send a
ReplicationHandshake (with term, last_ack_sequence, state enum e.g.
SYNCING/REPLICATING) as the first message followed by ReplicationAck messages;
add the ReplicationClientMessage wrapper (oneof { ReplicationHandshake,
ReplicationAck }) and ensure server-side Subscribe handler validates the
handshake fields (term and last_ack_sequence) and uses them to choose resume
point or reject/migrate the replica before processing subsequent ACKs.
- Around line 539-563: The unfencing logic in shouldUnfence currently returns
true based only on gslbResolvesToMe() || gslbHasNoHealthyTarget(), which risks
split-brain and references an undefined gslbHasNoHealthyTarget; update
shouldUnfence to (1) implement or remove gslbHasNoHealthyTarget() and (2) add an
explicit peer liveness/state probe (e.g., implement a new method like
probePeerIsActive() or reuse existing peer-check utilities) and require that
probePeerIsActive() == false before allowing unfence; ensure you call the probe
from shouldUnfence (alongside gslbResolvesToMe()) so the fenced node only
unfences if DNS conditions are met AND the peer is confirmed not ACTIVE.
- Around line 798-834: Add preflight checks and make the "ha failback" command
atomic by validating prerequisites (cluster connectivity, replication synced,
quorum, and agent connectivity) before executing the two-step signal sequence
shown in the failback example; implement a two-phase coordination where both
regions agree to the state change (prepare/commit), include configurable
timeouts, explicit error returns on any network/partition failures, and a
rollback path that reverts any partial changes if the commit to Region A
(promote) fails after demoting Region B, and ensure the command waits for and
verifies ACKs from Region B (replica) to Region A (new primary) before reporting
success.
- Around line 680-699: Make term persistence mandatory (resolve Open Question
`#4`) and update the handshake: when receiving AgentHello include a strict
validation in the principal's handshake logic to compare agent_term vs
principal_term and return FAILED_PRECONDITION with a clear message like "agent
term too new, principal restarted" if agent_term > principal_term; ensure the
principal refuses the connection in that case. On the agent side, update the
gRPC handshake retry handling so that on this specific "term too new" error the
agent clears its cached term and reconnects to fetch the new principal term
(same reconnection/backoff flow applies). Also document that term MUST be
persisted by the principal so it cannot regress after restart.
- Around line 82-92: The self-fencing section is ambiguous about which endpoint
probePeerHealth() checks; update the text to explicitly state that
probePeerHealth() probes the peer's HTTP health endpoint (port 8080, /healthz)
rather than the replication port (8404). Change the three decision bullets (No
ACK for ackTimeout, peer unreachable, peer reachable but not ACKing) to
reference probing :8080/healthz so readers implement the probe correctly and
avoid confusing replication-port failures with a dead peer.
🧹 Nitpick comments (1)
docs/proposals/ha-failover.md (1)
198-208: Document operational considerations for replication bandwidth and backpressure.The replication design sends ALL events across ALL agents (line 151) with full CloudEvents payloads. Missing considerations:
Bandwidth & Scalability:
- How much bandwidth is required for N agents? (estimates/benchmarks)
- Is cross-region bandwidth cost acceptable?
- Are there compression or batching strategies?
Backpressure:
- What happens if the replica falls behind (network slowdown, resource constraints)?
- Does the primary buffer events? For how long?
- Is there a circuit breaker to fence the primary if replication lag exceeds a threshold?
Memory:
- In-memory buffering of pending events could grow unbounded
- Need limits and overflow handling
These operational concerns directly impact the "1-5 minute RTO" design goal and could cause split-brain if not handled correctly (e.g., primary continues serving while replica has stale state).
Consider documenting:
- Expected replication bandwidth per agent
- Maximum tolerable replication lag before fencing
- Backpressure strategy (block writes? shed load? fence?)
- Memory limits for replication buffers
docs/proposals/ha-failover.md
Outdated
| service Replication { | ||
| // Subscribe to replication stream (replica calls this on primary) | ||
| // Bidirectional: primary sends events/heartbeats, replica sends ACKs | ||
| rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage); | ||
|
|
||
| // Get initial snapshot (replica calls on connect) | ||
| rpc GetSnapshot(SnapshotRequest) returns (ReplicationSnapshot); | ||
|
|
||
| // Get replication status | ||
| rpc Status(StatusRequest) returns (ReplicationStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add initial handshake message to Subscribe RPC for replica state synchronization.
The Subscribe RPC is bidirectional but lacks an initial handshake from the replica. Currently:
rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);Problem: The primary doesn't know the replica's:
- Current term (to detect term mismatch immediately)
- Last acknowledged sequence number (to resume from correct point)
- State (SYNCING, REPLICATING, etc.)
Risk: After network reconnection, the primary may send events the replica already has or skip events it missed, causing state divergence.
Recommended proto enhancement
+message ReplicationHandshake {
+ uint64 replica_term = 1;
+ uint64 last_acked_sequence_num = 2;
+ string replica_state = 3;
+}
+
+message ReplicaMessage {
+ oneof payload {
+ ReplicationHandshake handshake = 1;
+ ReplicationAck ack = 2;
+ }
+}
+
-rpc Subscribe(stream ReplicationAck) returns (stream ReplicationStreamMessage);
+rpc Subscribe(stream ReplicaMessage) returns (stream ReplicationStreamMessage);The replica sends handshake as the first message, then sends ACKs. The primary can validate term and resume from the correct sequence number.
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 326 - 335, Update the Subscribe
RPC to accept a client-side framed message that includes an explicit initial
handshake before ACKs: change the rpc signature to stream
ReplicationClientMessage so the replica can send a ReplicationHandshake (with
term, last_ack_sequence, state enum e.g. SYNCING/REPLICATING) as the first
message followed by ReplicationAck messages; add the ReplicationClientMessage
wrapper (oneof { ReplicationHandshake, ReplicationAck }) and ensure server-side
Subscribe handler validates the handshake fields (term and last_ack_sequence)
and uses them to choose resume point or reject/migrate the replica before
processing subsequent ACKs.
docs/proposals/ha-failover.md
Outdated
| ``` | ||
| Time PR1 (primary) PR2 (replica) GSLB | ||
| ───────────────────────────────────────────────────────────────── | ||
| T+0 Can't reach PR2 Can't reach PR1 PR1 HEALTHY | ||
| T+15s No ACKs, probes fail | ||
| → Stays ACTIVE - PR1 HEALTHY | ||
| (thinks PR2 dead) | ||
| T+30s - failoverTimeout | ||
| Probes PR1: fail | ||
| → STANDBY PR2 HEALTHY | ||
|
|
||
| Both HEALTHY! GSLB arbitrates. | ||
|
|
||
| Route53 PRIMARY/SECONDARY: | ||
| → Always returns PR1 (PRIMARY wins) | ||
| → Agents go to PR1 | ||
| → PR2 sits idle in STANDBY | ||
|
|
||
| When partition heals: | ||
| → PR2 sees PR1 is ACTIVE | ||
| → PR2 reconnects as replica | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Elevate GSLB PRIMARY/SECONDARY routing to a hard requirement.
Scenario 4 demonstrates that during full partition, both principals become HEALTHY (line 640). The resolution relies on:
Route53 PRIMARY/SECONDARY:
→ Always returns PR1 (PRIMARY wins)
This GSLB behavior is critical for correctness, not just a deployment option. Without PRIMARY/SECONDARY routing (or equivalent tie-breaking), agents could be load-balanced across both principals during a partition, causing split-brain at the application level.
Move this GSLB configuration requirement to:
- Design Decisions table (lines 11-17) as a new row
- Requirements section (lines 910-917) with MUST language
Example addition to requirements:
| GSLB tie-breaking | MUST use PRIMARY/SECONDARY or equivalent | Prevents dual-active during full partition |In the GSLB setup section (lines 906-974), add a warning:
⚠️ **Critical**: GSLB MUST implement deterministic tie-breaking (PRIMARY/SECONDARY, weighted routing with weights 100:0, or geoproximity with strict priority). Simple round-robin or least-connections will cause split-brain during full network partition.🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
629-629: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 629 - 650, The document currently
treats GSLB PRIMARY/SECONDARY routing as optional but the review shows it's a
hard correctness requirement; update the Design Decisions table by adding a new
row titled "GSLB tie-breaking" with the decision "MUST use PRIMARY/SECONDARY or
equivalent" and rationale "Prevents dual-active during full partition", update
the Requirements section to include a MUST requirement for deterministic GSLB
tie-breaking (use PRIMARY/SECONDARY, weighted 100:0, or strict geoproximity)
using MUST language, and add a prominent warning in the GSLB setup section
stating that deterministic tie-breaking is critical (no
round-robin/least-connections) and listing acceptable options so operators
cannot deploy without a tie-breaker.
| ``` | ||
| Agent Principal | ||
| ───────────────────────────────────────────── | ||
| 1. Connect to GSLB endpoint | ||
| (DNS resolves to healthy principal) | ||
|
|
||
| 2. gRPC handshake | ||
| AgentHello{term: <last known or 0>} | ||
| ──► If term matches: OK | ||
| ──► If term stale: FailedPrecondition | ||
|
|
||
| 3. On FailedPrecondition: | ||
| - Clear cached term | ||
| - Reconnect (get new term from principal) | ||
|
|
||
| 4. On disconnect (principal fenced/failed): | ||
| - Backoff + retry | ||
| - DNS may resolve to different principal | ||
| - New handshake gets new term | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Term persistence is required, not optional, to prevent stale term acceptance.
The handshake flow shows term validation (lines 688-689), but doesn't handle the case where the agent's term is newer than the principal's term.
Scenario:
- Principal is ACTIVE with term=5
- Agent connects, caches term=5
- Principal crashes and restarts
- If term is not persisted, principal starts with term=1 (or 0)
- Agent reconnects with cached term=5
- Principal with term=1 accepts the connection (agent's term doesn't match, but principal has no way to know if 5 is stale or valid)
Risk: The principal might accept connections from agents carrying state from a future term it hasn't reached yet, causing state inconsistency.
- Make term persistence mandatory, not optional (resolve Open Question
#4at line 1108) - Add validation: Principal should reject connections where
agent_term > principal_termwith a specific error likeFAILED_PRECONDITION: agent term too new, principal restarted - Agent behavior: On "term too new" error, agent should clear its cached term and reconnect
Update the handshake flow:
2. gRPC handshake
AgentHello{term: <last known or 0>}
- ──► If term matches: OK
- ──► If term stale: FailedPrecondition
+ ──► If term matches: OK
+ ──► If agent_term < principal_term: FailedPrecondition (stale)
+ ──► If agent_term > principal_term: FailedPrecondition (principal restarted)As per the design, addressing Open Question #4 should state that term MUST be persisted.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
680-680: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 680 - 699, Make term persistence
mandatory (resolve Open Question `#4`) and update the handshake: when receiving
AgentHello include a strict validation in the principal's handshake logic to
compare agent_term vs principal_term and return FAILED_PRECONDITION with a clear
message like "agent term too new, principal restarted" if agent_term >
principal_term; ensure the principal refuses the connection in that case. On the
agent side, update the gRPC handshake retry handling so that on this specific
"term too new" error the agent clears its cached term and reconnects to fetch
the new principal term (same reconnection/backoff flow applies). Also document
that term MUST be persisted by the principal so it cannot regress after restart.
docs/proposals/ha-failover.md
Outdated
| ### Failback Command | ||
|
|
||
| ```bash | ||
| # Check current HA status | ||
| $ argocd-agent ha status | ||
| Region A (principal.region-a.internal): | ||
| State: STANDBY | ||
| Role: preferred-primary | ||
| Term: 2 | ||
| Replication: synced (lag: 0s) | ||
| Last event: 2024-01-15T10:30:00Z | ||
|
|
||
| Region B (principal.region-b.internal): | ||
| State: ACTIVE | ||
| Role: preferred-replica | ||
| Term: 2 | ||
| Connected agents: 47 | ||
| Uptime: 2h 15m | ||
|
|
||
| # Trigger failback (requires confirmation) | ||
| $ argocd-agent ha failback --to region-a | ||
|
|
||
| WARNING: This will: | ||
| 1. Promote Region A to primary (term → 3) | ||
| 2. Demote Region B to replica | ||
| 3. Cause all 47 agents to reconnect | ||
|
|
||
| Region A replication status: SYNCED (lag: 0s) | ||
|
|
||
| Proceed? [y/N]: y | ||
|
|
||
| [1/3] Signaling Region B to become replica... done | ||
| [2/3] Signaling Region A to become primary (term=3)... done | ||
| [3/3] Waiting for agents to reconnect... 47/47 connected | ||
|
|
||
| Failback complete. Region A is now primary (term=3). | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Ensure failback command is atomic and validates prerequisites.
The failback command (lines 817-834) sends signals to both regions:
[1/3] Signaling Region B to become replica... done
[2/3] Signaling Region A to become primary (term=3)... done
Risk: If network partition between regions is still present or intermittent:
- Step [1/3] might fail silently or timeout
- Step [2/3] succeeds
- Both regions think they're primary → split-brain
Add prerequisites validation before execution:
$ argocd-agent ha failback --to region-a
+Validating prerequisites...
+✓ Region A is reachable
+✓ Region B is reachable
+✓ Region A ↔ Region B connectivity OK
+✓ Region A replication lag: 0s (SYNCED)
+
WARNING: This will:
1. Promote Region A to primary (term → 3)
2. Demote Region B to replica
3. Cause all 47 agents to reconnect
Region A replication status: SYNCED (lag: 0s)
Proceed? [y/N]: yIf any prerequisite fails, abort with clear error. This prevents operator from executing failback during ongoing partition or connectivity issues.
Additionally, the command should:
- Execute as a transaction (both regions agree before state changes)
- Have a timeout and rollback mechanism if step [2/3] fails after [1/3] succeeds
- Verify the new primary (Region A) receives ACKs from new replica (Region B) before completing
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 798 - 834, Add preflight checks
and make the "ha failback" command atomic by validating prerequisites (cluster
connectivity, replication synced, quorum, and agent connectivity) before
executing the two-step signal sequence shown in the failback example; implement
a two-phase coordination where both regions agree to the state change
(prepare/commit), include configurable timeouts, explicit error returns on any
network/partition failures, and a rollback path that reverts any partial changes
if the commit to Region A (promote) fails after demoting Region B, and ensure
the command waits for and verifies ACKs from Region B (replica) to Region A (new
primary) before reporting success.
Adds alternative to the self-fencing design (Option A) that eliminates split-brain complexity by making all promotion decisions external. Two modes: - Manual: operator runs promote/demote + updates DNS. Works with any DNS provider. No automatic failover, no fencing, no terms. - Coordinator: pluggable interface (Route53 ARC, etc.) polls external authority to determine active principal. Reduces state machine from 7 states to 4, removes heartbeat/ACK protocol, self-fencing, and term/epoch system entirely."
docs/proposals/ha-failover.md
Outdated
|
|
||
| --- | ||
|
|
||
| ## Option B: Simplified Operations Model (Recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jannfis I think self-fencing, terms/epochs etc got too complex.. If we rely on external promotion decisions (manual or something like route53 ARC) the principle logic is simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 780-783: The document has an inconsistency: the failback timeline
shows a recovered node in State: STANDBY with Health: UNHEALTHY, but the
health-check table declares STANDBY as HEALTHY; update the docs to make these
consistent by choosing one behavior and applying it everywhere (e.g., if
failback requires operator promotion, change the health-check table entry for
STANDBY to UNHEALTHY or introduce a distinct state like STANDBY_SYNCED_NOT_READY
and mark that UNHEALTHY), and add a short note in the failback sequence (the
"Caught up!" step) referencing the chosen rule (preferredRole/manual-ready flag)
so readers know when GSLB may start routing to STANDBY.
- Around line 1130-1151: The Promote() logic must treat an unreachable peer as a
blocking condition to avoid split-brain: update Controller.Promote to consider
err != nil from checkPeerStatus() as a fatal precondition (reject promotion with
a clear error referencing c.options.PeerAddress) unless the caller passed force;
keep the existing force behavior (log a warning and proceed), and still perform
c.transitionTo(StateActive) only after the reachable-peer check (use
checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate
the change).
🧹 Nitpick comments (2)
docs/proposals/ha-failover.md (2)
11-17: Design Decisions table only reflects Option A.The table at the top of the document describes "Self-fencing on lost peer ACK" as the split-brain prevention mechanism, but Option B (lines 1083–1321) proposes removing self-fencing entirely in favor of external authority-based promotion. This creates confusion about which design is being proposed.
Recommendation: Either:
- Add a note above this table: "Note: This table describes Option A. See Option B starting at line 1083 for an alternative simplified model."
- Create two separate tables (one for Option A, one for Option B)
- Move Option B to a separate document if it's a competing proposal
This will help readers understand they're evaluating two distinct architectures.
338-340: Consider adding term to SnapshotRequest for consistency.The
SnapshotRequestmessage only includessince_sequence_num, but notermfield. This means when a principal callsGetSnapshot, the serving principal cannot validate whether the requester is operating on a stale term.Scenario:
- PR1 is ACTIVE with term=5
- PR1 crashes and comes back with term=1 (if term not persisted)
- PR2 is now ACTIVE with term=6
- PR1 calls GetSnapshot with since_sequence_num from its old state
- PR2 has no way to reject this stale request
While this might auto-correct during subsequent replication, explicitly validating term at snapshot time would catch stale principals earlier.
Recommendation:
message SnapshotRequest { uint64 since_sequence_num = 1; + uint64 term = 2; }And in the implementation, reject requests with term < current term.
Note: This is closely related to the term persistence issue (Open Question
#4, line 1332, already flagged in past comments). If term is persisted, this becomes less critical.
docs/proposals/ha-failover.md
Outdated
| ### Safety: Promote Refuses if Peer is ACTIVE | ||
|
|
||
| The `promote` command checks peer state before allowing promotion. This prevents the most common operator error (promoting without demoting the other side): | ||
|
|
||
| ```go | ||
| func (c *Controller) Promote(ctx context.Context, force bool) error { | ||
| if c.state == StateActive { | ||
| return fmt.Errorf("already active") | ||
| } | ||
|
|
||
| peer, err := c.checkPeerStatus(ctx) | ||
| if err == nil && peer.State == StateActive { | ||
| if !force { | ||
| return fmt.Errorf("peer is ACTIVE at %s — demote peer first, or use --force", c.options.PeerAddress) | ||
| } | ||
| log().Warn("Force-promoting while peer is ACTIVE") | ||
| } | ||
|
|
||
| c.transitionTo(StateActive) | ||
| return nil | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option B: Safety check cannot prevent stale principal during partition.
The Promote() safety check (lines 1140-1146) prevents promoting when the peer is reachable and ACTIVE, but doesn't handle the partition case where the old primary is unreachable:
Scenario:
- PR1 is ACTIVE, serving agents
- Network partition: PR1 and PR2 cannot reach each other
- Operator runs
promoteon PR2 - Line 1140:
checkPeerStatus()fails with error (peer unreachable) - Line 1141:
err == nilis false, so thepeer.State == StateActivecheck is skipped - Line 1148: PR2 transitions to ACTIVE
- Both PR1 and PR2 are now ACTIVE → split-brain
Since Option B removes terms/epochs (line 1096), there's no mechanism to reject stale connections. If agents reconnect to PR1 after the partition heals, PR1 has no way to know it's been demoted.
Possible solutions:
- Require reachable peer for promotion: Change line 1141 to reject if
err != nil(peer must be reachable to confirm it's not ACTIVE) - Add back minimal term validation even in Option B: Increment a version on promotion, include in agent handshakes
- Document as known limitation: State that Option B requires manual verification that old primary is truly dead before promoting
Recommend addressing this before Option B can be considered a complete alternative.
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 1130 - 1151, The Promote() logic
must treat an unreachable peer as a blocking condition to avoid split-brain:
update Controller.Promote to consider err != nil from checkPeerStatus() as a
fatal precondition (reject promotion with a clear error referencing
c.options.PeerAddress) unless the caller passed force; keep the existing force
behavior (log a warning and proceed), and still perform
c.transitionTo(StateActive) only after the reachable-peer check (use
checkPeerStatus(), Promote(), transitionTo(), and options.PeerAddress to locate
the change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/proposals/ha-failover.md`:
- Around line 21-52: The markdown file docs/proposals/ha-failover.md contains
multiple fenced code blocks (ASCII diagrams and console transcripts) without a
language tag, triggering markdownlint; update each triple-backtick fence that
encloses the diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram
block, the RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and
the argocd-agentctl/console snippets) to use a language tag such as "text"
(change ``` to ```text) so linting and readability are improved while preserving
content.
| ``` | ||
| ┌─────────────────────────────────────────────────────────────────┐ | ||
| │ Global DNS / GSLB │ | ||
| │ principal.argocd.example.com │ | ||
| │ Health checks: /healthz (200 only when ACTIVE) │ | ||
| └──────────────┬──────────────────────────────┬───────────────────┘ | ||
| │ │ | ||
| ▼ ▼ | ||
| ┌──────────────────────┐ ┌──────────────────────┐ | ||
| │ REGION A (Primary) │ │ REGION B (Replica) │ | ||
| │ │ │ │ | ||
| │ Principal Server │◄─────│ Replication Client │ | ||
| │ - gRPC :8403 │ │ - gRPC :8403 │ | ||
| │ - Replication :8404 │─────►│ - Mirrors state │ | ||
| │ - /healthz :8080 │ │ - /healthz :8080 │ | ||
| │ │ │ │ | ||
| │ ArgoCD Instance │ │ ArgoCD Instance │ | ||
| │ (source of truth) │ │ (standby, receives │ | ||
| │ │ │ replicated state) │ | ||
| └──────────────────────┘ └──────────────────────┘ | ||
| ▲ ▲ | ||
| │ Replication Stream: │ | ||
| │ All events forwarded │ | ||
| │ Primary → Replica │ | ||
| │ │ | ||
| ┌───────┴───────────────────────────────┴───────┐ | ||
| │ Remote Clusters │ | ||
| │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ | ||
| │ │ Agent 1 │ │ Agent 2 │ │ Agent N │ │ | ||
| │ └─────────┘ └─────────┘ └─────────┘ │ | ||
| └───────────────────────────────────────────────┘ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add languages to fenced code blocks.
markdownlint warns on these blocks; adding languages improves readability and linting compliance.
Suggested fixes:
🛠️ Proposed diff
-```
+```text
┌─────────────────────────────────────────────────────────────────┐
...
-```
+```
-```
+```text
RECOVERING ──┬── config=primary & peer not ACTIVE ──→ ACTIVE
...
-```
+```
-```
+```text
T+0s Primary dies. Replica detects stream break → DISCONNECTED.
...
-```
+```
-```
+```text
$ argocd-agentctl ha demote # on Region A — drops agents, stops serving
...
-```
+```
-```
+```text
T+0 Old primary restarts → RECOVERING → SYNCING (peer is ACTIVE)
...
-```
+```Also applies to: 62-72, 135-142, 148-152, 156-164
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 21-21: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/proposals/ha-failover.md` around lines 21 - 52, The markdown file
docs/proposals/ha-failover.md contains multiple fenced code blocks (ASCII
diagrams and console transcripts) without a language tag, triggering
markdownlint; update each triple-backtick fence that encloses the
diagrams/transcripts (e.g., the Global DNS/GSLB ASCII diagram block, the
RECOVERING/ACTIVE state diagram, timeline blocks starting "T+0s", and the
argocd-agentctl/console snippets) to use a language tag such as "text" (change
``` to ```text) so linting and readability are improved while preserving
content.
PR form of #720 for easier commenting / collaboration
Summary by CodeRabbit