|
| 1 | +# Manual failover |
| 2 | + |
| 3 | +While automatic failures are already baked-in, there may be times where a manually issued failover is necessary. The steps to perform a manual failover are listed below: |
| 4 | + |
| 5 | +**Note: The promotion candidate must reside within your PRIMARY_REGION.** |
| 6 | + |
| 7 | +1. Connect to the Machine you wish to promote |
| 8 | +``` |
| 9 | +fly ssh console -s <app-name> |
| 10 | +``` |
| 11 | + |
| 12 | +2. Confirm the member is healthy. |
| 13 | +``` |
| 14 | +# Switch to the postgres user and move to the home directory. |
| 15 | +su postgres |
| 16 | +cd ~ |
| 17 | +
|
| 18 | +# Verify member is healthy |
| 19 | +repmgr node check |
| 20 | +
|
| 21 | +Node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2": |
| 22 | + Server role: OK (node is standby) |
| 23 | + Replication lag: OK (0 seconds) |
| 24 | + WAL archiving: OK (0 pending archive ready files) |
| 25 | + Upstream connection: OK (node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) is attached to expected upstream node "fdaa:0:2e26:a7b:c850:86cf:b175:2" (ID: 1177616922)) |
| 26 | + Downstream servers: OK (this node has no downstream nodes) |
| 27 | + Replication slots: OK (node has no physical replication slots) |
| 28 | + Missing physical replication slots: OK (node has no missing physical replication slots) |
| 29 | + Configured data directory: OK (configured "data_directory" is "/data/postgresql") |
| 30 | +
|
| 31 | +``` |
| 32 | + |
| 33 | +2. Stop the Machine running as primary |
| 34 | +Open up a separate terminal and stop the Machine running `primary`. |
| 35 | + |
| 36 | +``` |
| 37 | +# Identify the primary |
| 38 | +fly status --app <app-name> |
| 39 | +
|
| 40 | +ID STATE ROLE REGION HEALTH CHECKS IMAGE CREATED UPDATED |
| 41 | +6e8226ec711087 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:51Z 2023-02-15T20:21:10Z |
| 42 | +6e82931b729087 started primary lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:19:58Z 2023-02-15T20:20:18Z |
| 43 | +9185957f411683 started replica lax 3 total, 3 passing davissp14/postgres-flex:recovery-fix-00 (custom) 2023-02-15T20:20:24Z 2023-02-15T20:20:45Z |
| 44 | +
|
| 45 | +
|
| 46 | +fly machines stop 6e82931b729087 --app <app-name> |
| 47 | +``` |
| 48 | + |
| 49 | +3. Run the standby promotion command |
| 50 | +Go back to the first terminal you opened that's connected to your promotion candidate. |
| 51 | + |
| 52 | +**WARNING: It's important that you specify `--siblings-follow`, otherwise any other standbys will not be reconfigured to follow the new primary.** |
| 53 | +``` |
| 54 | +# Issue a dry-run to ensure our candidate is eligible for promotion. |
| 55 | +repmgr standby promote --siblings-follow --dry-run |
| 56 | +
|
| 57 | +INFO: node is a standby |
| 58 | +INFO: no active primary server found in this replication cluster |
| 59 | +INFO: all sibling nodes are reachable via SSH |
| 60 | +INFO: 1 walsenders required, 10 available |
| 61 | +INFO: 1 replication slots required, 10 available |
| 62 | +INFO: node will be promoted using the "pg_promote()" function |
| 63 | +INFO: prerequisites for executing STANDBY PROMOTE are met |
| 64 | +``` |
| 65 | + |
| 66 | +If everything looks good, go ahead and re-run the command without the `--dry-run` argument. |
| 67 | +``` |
| 68 | +repmgr standby promote --siblings-follow |
| 69 | +
|
| 70 | +NOTICE: promoting standby to primary |
| 71 | +DETAIL: promoting server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) using pg_promote() |
| 72 | +NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete |
| 73 | +NOTICE: STANDBY PROMOTE successful |
| 74 | +DETAIL: server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) was successfully promoted to primary |
| 75 | +INFO: executing notification command for event "standby_promote" |
| 76 | +DETAIL: command is: |
| 77 | + /usr/local/bin/event_handler -node-id 1375486377 -event standby_promote -success 1 -details "server \"fdaa:0:2e26:a7b:7d17:9b36:6e4b:2\" (ID: 1375486377) was successfully promoted to primary" -new-node-id '' |
| 78 | +NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings |
| 79 | +INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes |
| 80 | +``` |
| 81 | + |
| 82 | +4. Start the Machine that was previously operating as Primary |
| 83 | +``` |
| 84 | +fly machines start 6e82931b729087 --app <app-name> |
| 85 | +``` |
| 86 | + |
| 87 | +The primary will come back up and recognizing that it's no longer the true primary and will rejoin the cluster as a standby. |
0 commit comments