Skip to content

Commit 6b08d8a

Browse files
committed
Adding docs covering the manual failover process
1 parent d82bf44 commit 6b08d8a

File tree

1 file changed

+87
-0
lines changed

1 file changed

+87
-0
lines changed

docs/manual_failovers.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Manual failover
2+
3+
While automatic failures are already baked-in, there may be times where a manually issued failover is necessary. The steps to perform a manual failover are listed below:
4+
5+
**Note: The promotion candidate must reside within your PRIMARY_REGION.**
6+
7+
1. Connect to the Machine you wish to promote
8+
```
9+
fly ssh console -s <app-name>
10+
```
11+
12+
2. Confirm the member is healthy.
13+
```
14+
# Switch to the postgres user and move to the home directory.
15+
su postgres
16+
cd ~
17+
18+
# Verify member is healthy
19+
repmgr node check
20+
21+
Node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2":
22+
Server role: OK (node is standby)
23+
Replication lag: OK (0 seconds)
24+
WAL archiving: OK (0 pending archive ready files)
25+
Upstream connection: OK (node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) is attached to expected upstream node "fdaa:0:2e26:a7b:c850:86cf:b175:2" (ID: 1177616922))
26+
Downstream servers: OK (this node has no downstream nodes)
27+
Replication slots: OK (node has no physical replication slots)
28+
Missing physical replication slots: OK (node has no missing physical replication slots)
29+
Configured data directory: OK (configured "data_directory" is "/data/postgresql")
30+
31+
```
32+
33+
2. Stop the Machine running as primary
34+
Open up a separate terminal and stop the Machine running `primary`.
35+
36+
```
37+
# Identify the primary
38+
fly status --app <app-name>
39+
40+
ID STATE ROLE REGION HEALTH CHECKS IMAGE CREATED UPDATED
41+
6e8226ec711087 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:51Z 2023-02-15T20:21:10Z
42+
6e82931b729087 started primary lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:19:58Z 2023-02-15T20:20:18Z
43+
9185957f411683 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:24Z 2023-02-15T20:20:45Z
44+
45+
46+
fly machines stop 6e82931b729087 --app <app-name>
47+
```
48+
49+
3. Run the standby promotion command
50+
Go back to the first terminal you opened that's connected to your promotion candidate.
51+
52+
**WARNING: It's important that you specify `--siblings-follow`, otherwise any other standbys will not be reconfigured to follow the new primary.**
53+
```
54+
# Issue a dry-run to ensure our candidate is eligible for promotion.
55+
repmgr standby promote --siblings-follow --dry-run
56+
57+
INFO: node is a standby
58+
INFO: no active primary server found in this replication cluster
59+
INFO: all sibling nodes are reachable via SSH
60+
INFO: 1 walsenders required, 10 available
61+
INFO: 1 replication slots required, 10 available
62+
INFO: node will be promoted using the "pg_promote()" function
63+
INFO: prerequisites for executing STANDBY PROMOTE are met
64+
```
65+
66+
If everything looks good, go ahead and re-run the command without the `--dry-run` argument.
67+
```
68+
repmgr standby promote --siblings-follow
69+
70+
NOTICE: promoting standby to primary
71+
DETAIL: promoting server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) using pg_promote()
72+
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
73+
NOTICE: STANDBY PROMOTE successful
74+
DETAIL: server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) was successfully promoted to primary
75+
INFO: executing notification command for event "standby_promote"
76+
DETAIL: command is:
77+
/usr/local/bin/event_handler -node-id 1375486377 -event standby_promote -success 1 -details "server \"fdaa:0:2e26:a7b:7d17:9b36:6e4b:2\" (ID: 1375486377) was successfully promoted to primary" -new-node-id ''
78+
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
79+
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
80+
```
81+
82+
4. Start the Machine that was previously operating as Primary
83+
```
84+
fly machines start 6e82931b729087 --app <app-name>
85+
```
86+
87+
The primary will come back up and recognizing that it's no longer the true primary and will rejoin the cluster as a standby.

0 commit comments

Comments
 (0)