Skip to content

Commit 948ae9b

Browse files
authored
Merge pull request #97 from fly-apps/docs
Adding some basic docs
2 parents e42e5b0 + 29c797f commit 948ae9b

File tree

4 files changed

+206
-0
lines changed

4 files changed

+206
-0
lines changed

docs/capacity_monitoring.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Capacity monitoring
2+
3+
Disk capacity is monitored at regular intervals. When capacity exceeds the pre-defined threshold of 90%, every user-defined table will become read-only. When disk usage falls below the defined threshold, either through file cleanup or volume extension read/write will be re-enabled automatically.
4+
5+
## Resolving disk capacity issues
6+
Disk capacity must be brought below 90% before read/writes will be re-enabled. The best way to do this is to simply extend your volume.
7+
8+
**List your volumes**
9+
```
10+
fly volumes list --app flex-testing
11+
12+
ID STATE NAME SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
13+
vol_okgj54527584y2wz created pg_data 10GB lax 1581 true 9185340f4d3383 59 minutes ago
14+
```
15+
16+
**Extend the volume**
17+
```
18+
$ fly volumes extend vol_w0enxv3o9pov8okp --size 15 --app flex-testing
19+
ID: vol_okgj54527584y2wz
20+
Name: pg_data
21+
App: flex-testing
22+
Region: lax
23+
Zone: c6d5
24+
Size GB: 15
25+
Encrypted: true
26+
Created at: 15 Feb 23 15:23 UTC
27+
28+
You will need to stop and start your machine to increase the size of the FS
29+
```
30+
31+
**Restart the Machine tied to your Volume**
32+
```
33+
fly machines restart 9185340f4d3383 --app flex-testing
34+
```

docs/fencing.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Fencing
2+
3+
## How do we verify the real primary?
4+
We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
5+
6+
The "clusters state" is represented across a few different dimensions:
7+
8+
**Total members**
9+
Number of registered members, including the primary.
10+
11+
**Total active members**
12+
Number of members that are responsive. This includes the primary we are evaluating, so this will never be less than one.
13+
14+
**Total inactive members**
15+
Number of registered members that are non-responsive.
16+
17+
**Conflict map**
18+
The conflict map is a `map[string]int` that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.
19+
20+
As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match. This will be recorded as:
21+
```
22+
map[string]int{
23+
"fdaa:0:2e26:a7b:8c31:bf37:488c:2": 2
24+
}
25+
```
26+
27+
The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as `total_members / 2 + 1`.
28+
29+
**There is one exception to note here. When the primary being evaluated meets quorum, it will still be fenced in the event a conflict is found. This is to protect against a possible race condition where the old primary comes back up during an active failover.**
30+
31+
Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362
32+
33+
## What if the real primary can't be resolved or doesn't match the booting primary?
34+
35+
In both of these instances the primary member will be fenced.
36+
37+
**If the real primary is resolvable**
38+
The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a `zombie.lock` file. The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions. Once completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from the `zombie.lock` file and use that to attempt to rejoin the cluster we diverged from. If we are successful, the `zombie.lock` is cleared and we will boot as a standby.
39+
40+
**Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the `PRIMARY_REGION` environment variable set on self. The `PRIMARY_REGION` will need to be updated before a rejoin will be attempted.**
41+
42+
**If the real primary is NOT resolvable**
43+
The cluster will be made read-only, PGBouncer will remain disabled and a `zombie.lock` file will be created without a value. When the member reboots, we will read the `zombie.lock` file and see that it's empty. This indicates that we've entered a failure mode that can't be recovered automatically. This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.

docs/manual_failovers.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Manual failover
2+
3+
While automatic failures are already baked-in, there may be times where a manually issued failover is necessary. The steps to perform a manual failover are listed below:
4+
5+
**Note: The promotion candidate must reside within your PRIMARY_REGION.**
6+
7+
1. Connect to the Machine you wish to promote
8+
```
9+
fly ssh console -s <app-name>
10+
```
11+
12+
2. Confirm the member is healthy.
13+
```
14+
# Switch to the postgres user and move to the home directory.
15+
su postgres
16+
cd ~
17+
18+
# Verify member is healthy
19+
repmgr node check
20+
21+
Node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2":
22+
Server role: OK (node is standby)
23+
Replication lag: OK (0 seconds)
24+
WAL archiving: OK (0 pending archive ready files)
25+
Upstream connection: OK (node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) is attached to expected upstream node "fdaa:0:2e26:a7b:c850:86cf:b175:2" (ID: 1177616922))
26+
Downstream servers: OK (this node has no downstream nodes)
27+
Replication slots: OK (node has no physical replication slots)
28+
Missing physical replication slots: OK (node has no missing physical replication slots)
29+
Configured data directory: OK (configured "data_directory" is "/data/postgresql")
30+
31+
```
32+
33+
2. Stop the Machine running as primary
34+
Open up a separate terminal and stop the Machine running `primary`.
35+
36+
```
37+
# Identify the primary
38+
fly status --app <app-name>
39+
40+
ID STATE ROLE REGION HEALTH CHECKS IMAGE CREATED UPDATED
41+
6e8226ec711087 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:51Z 2023-02-15T20:21:10Z
42+
6e82931b729087 started primary lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:19:58Z 2023-02-15T20:20:18Z
43+
9185957f411683 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:24Z 2023-02-15T20:20:45Z
44+
45+
46+
fly machines stop 6e82931b729087 --app <app-name>
47+
```
48+
49+
3. Run the standby promotion command
50+
Go back to the first terminal you opened that's connected to your promotion candidate.
51+
52+
**WARNING: It's important that you specify `--siblings-follow`, otherwise any other standbys will not be reconfigured to follow the new primary.**
53+
```
54+
# Issue a dry-run to ensure our candidate is eligible for promotion.
55+
repmgr standby promote --siblings-follow --dry-run
56+
57+
INFO: node is a standby
58+
INFO: no active primary server found in this replication cluster
59+
INFO: all sibling nodes are reachable via SSH
60+
INFO: 1 walsenders required, 10 available
61+
INFO: 1 replication slots required, 10 available
62+
INFO: node will be promoted using the "pg_promote()" function
63+
INFO: prerequisites for executing STANDBY PROMOTE are met
64+
```
65+
66+
If everything looks good, go ahead and re-run the command without the `--dry-run` argument.
67+
```
68+
repmgr standby promote --siblings-follow
69+
70+
NOTICE: promoting standby to primary
71+
DETAIL: promoting server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) using pg_promote()
72+
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
73+
NOTICE: STANDBY PROMOTE successful
74+
DETAIL: server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) was successfully promoted to primary
75+
INFO: executing notification command for event "standby_promote"
76+
DETAIL: command is:
77+
/usr/local/bin/event_handler -node-id 1375486377 -event standby_promote -success 1 -details "server \"fdaa:0:2e26:a7b:7d17:9b36:6e4b:2\" (ID: 1375486377) was successfully promoted to primary" -new-node-id ''
78+
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
79+
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
80+
```
81+
82+
4. Start the Machine that was previously operating as Primary
83+
```
84+
fly machines start 6e82931b729087 --app <app-name>
85+
```
86+
87+
The primary will come back up and recognizing that it's no longer the true primary and will rejoin the cluster as a standby.

docs/troubleshooting.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Troubleshooting
2+
3+
4+
## Member unregistration failed when removing machine
5+
6+
```
7+
$ fly machines remove 9185340f4d3383 --app flex-testing
8+
machine 9185340f4d3383 was found and is currently in stopped state, attempting to destroy...
9+
unregistering postgres member 'fdaa:0:2e26:a7b:7d16:cff7:9849:2' from the cluster... <insert-random-error-here> (failed)
10+
11+
9185340f4d3383 has been destroyed
12+
```
13+
Unfortionately, this can happen for a variety of reasons. If no action is taken, the member and associated replication slot will automatically be cleaned up after 24 hours. Depending on the current cluster size, problems can arise if the down member impacts the clusters ability to meet quorum. If this case, it's important to take action right away to prevent your cluster from going read-only.
14+
15+
16+
To address this, start by ssh'ing into one of your running Machines.
17+
18+
```
19+
fly ssh console --app <app-name>
20+
```
21+
22+
Switch to the postgres user and move into the home directory.
23+
```
24+
su postgres
25+
cd ~
26+
```
27+
28+
Use the `rempgr` cli tool to view the current cluster state.
29+
```
30+
repmgr daemon status
31+
32+
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
33+
----+----------------------------------+---------+---------------+------------------------------------+---------+-----+---------+--------------------
34+
376084936 | fdaa:0:2e26:a7b:7d18:1a68:804e:2 | primary | * running | | running | 630 | no | n/a
35+
1349952263 | fdaa:0:2e26:a7b:7d17:4463:955d:2 | standby | ? unreachable | ? fdaa:0:2e26:a7b:7d18:1a68:804e:2 | n/a | n/a | n/a | n/a
36+
1412735685 | fdaa:0:2e26:a7b:c850:8f12:fb1d:2 | standby | running | fdaa:0:2e26:a7b:7d18:1a68:804e:2 | running | 617 | no | 1 second(s) ago
37+
```
38+
39+
Manually unregister the unreachable standby.
40+
```
41+
repmgr standby unregister --node-id 1349952263
42+
```

0 commit comments

Comments
 (0)