Skip to content

Commit adb02b7

Browse files
authored
Limit voting members to the those residing within the primary region (#193)
* Limit voting members to the those residing in the primary region * Require in-region quorum * Update docs to reflect changes * Typo
1 parent 215ed8b commit adb02b7

File tree

4 files changed

+8
-11
lines changed

4 files changed

+8
-11
lines changed

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,7 @@ fly pg create --name <app-name> --initial-cluster-size 3 --region ord --flex
1515
```
1616

1717
## High Availability
18-
For HA, it's recommended that you run at least 3 members.
19-
20-
Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the `fly.toml` file. That being said, if you're running a 3 member setup at least 2 of your members should reside within your primary region.
18+
For HA, it's recommended that you run at least 3 members within your primary region. Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the `fly.toml` file.
2119

2220
## Horizontal scaling
2321
Use the clone command to scale up your cluster.

docs/fencing.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Fencing
22

33
## How do we verify the real primary?
4-
We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
4+
We start out by evaluating the cluster state by checking each registered standby within the primary region for connectivity and asking who their primary is.
55

66
The "clusters state" is represented across a few different dimensions:
77

@@ -24,7 +24,7 @@ map[string]int{
2424
}
2525
```
2626

27-
The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as `total_members / 2 + 1`.
27+
The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as `total_members_in_region / 2 + 1`.
2828

2929
**Note: When the primary being evaluated meets quorum, it will still be fenced in the event a conflict is found. This is to protect against a possible race condition where an old primary comes back up in the middle of an active failover.**
3030

@@ -45,11 +45,11 @@ The cluster will be made read-only and the `zombie.lock` file will be created wi
4545

4646
## Monitoring cluster state
4747

48-
In order to mitigate possible split-brain scenarios, it's important that cluster state is evaluated regularly and when specific events/actions take place.
48+
In order to mitigate possible split-brain scenarios, it's important that cluster state is evaluated regularly and when specific events/actions take place.
4949

5050
### On boot
5151
This is to ensure the booting primary is not a primary coming back from the dead.
52-
52+
5353
### During standby connect/reconnect/disconnect events
5454
There are a myriad of reasons why a standby might disconnect, but we have to assume the possibility of a network partition. In either case, if quorum is lost, the primary will be fenced.
5555

@@ -60,7 +60,7 @@ Cluster state is monitored in the background at regular intervals. This acts as
6060
## Split-brain detection window
6161
**This pertains to v0.0.36+**
6262

63-
When a network partition is initiated, the following steps are performed:
63+
When a network partition is initiated, the following steps are performed:
6464

6565
1. Repmgr will attempt to ping registered members with a 5s connect timeout.
6666
2. Repmgr will wait up to 30 seconds for the standby to reconnect before issuing a `child_node_disconnect` event.

internal/flypg/repmgr.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -388,7 +388,7 @@ func (r *RepMgr) VotingMembers(ctx context.Context, conn *pgx.Conn) ([]Member, e
388388

389389
var voters []Member
390390
for _, member := range members {
391-
if member.Role == StandbyRoleName || member.Role == WitnessRoleName {
391+
if (member.Role == StandbyRoleName || member.Role == WitnessRoleName) && member.Region == r.PrimaryRegion {
392392
voters = append(voters, member)
393393
}
394394
}

internal/flypg/zombie.go

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,6 @@ func TakeDNASample(ctx context.Context, node *Node, standbys []Member) (*DNASamp
122122
sample.totalConflicts++
123123
sample.conflictMap[primary.Hostname]++
124124
}
125-
126125
}
127126

128127
return sample, nil
@@ -182,7 +181,7 @@ func Quarantine(ctx context.Context, n *Node, primary string) error {
182181
}
183182

184183
func DNASampleString(s *DNASample) string {
185-
return fmt.Sprintf("Registered members: %d, Active member(s): %d, Inactive member(s): %d, Conflicts detected: %d",
184+
return fmt.Sprintf("Voting member(s): %d, Active: %d, Inactive: %d, Conflicts: %d",
186185
s.totalMembers,
187186
s.totalActive,
188187
s.totalInactive,

0 commit comments

Comments
 (0)