Merge pull request #97 from fly-apps/docs

davissp14 · web-flow · commit 948ae9b9426f · 2023-02-15T14:46:54.000-06:00
Adding some basic docs
diff --git a/docs/capacity_monitoring.md b/docs/capacity_monitoring.md
@@ -0,0 +1,34 @@
+# Capacity monitoring
+
+Disk capacity is monitored at regular intervals. When capacity exceeds the pre-defined threshold of 90%, every user-defined table will become read-only. When disk usage falls below the defined threshold, either through file cleanup or volume extension read/write will be re-enabled automatically.
+
+## Resolving disk capacity issues
+Disk capacity must be brought below 90% before read/writes will be re-enabled. The best way to do this is to simply extend your volume.
+
+**List your volumes**
+```
+fly volumes list --app flex-testing
+
+ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM   	CREATED AT
+vol_okgj54527584y2wz	created	pg_data	10GB	lax   	1581	true     	9185340f4d3383	59 minutes ago
+```
+
+**Extend the volume**
+```
+$ fly volumes extend vol_w0enxv3o9pov8okp --size 15 --app flex-testing
+        ID: vol_okgj54527584y2wz
+      Name: pg_data
+       App: flex-testing
+    Region: lax
+      Zone: c6d5
+   Size GB: 15
+ Encrypted: true
+Created at: 15 Feb 23 15:23 UTC
+
+You will need to stop and start your machine to increase the size of the FS
+```
+
+**Restart the Machine tied to your Volume**
+```
+fly machines restart 9185340f4d3383 --app flex-testing
+```
diff --git a/docs/fencing.md b/docs/fencing.md
@@ -0,0 +1,43 @@
+# Fencing
+
+## How do we verify the real primary?
+We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
+
+The "clusters state" is represented across a few different dimensions:
+
+**Total members**
+Number of registered members, including the primary.
+
+**Total active members**
+Number of members that are responsive.  This includes the primary we are evaluating, so this will never be less than one.
+
+**Total inactive members**
+Number of registered members that are non-responsive.
+
+**Conflict map**
+The conflict map is a `map[string]int` that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.
+
+As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match.  This will be recorded as:
+```
+map[string]int{
+  "fdaa:0:2e26:a7b:8c31:bf37:488c:2": 2
+}
+```
+
+The real primary is resolvable so long as the majority of members can agree on who it is.  Quorum being defined as `total_members / 2 + 1`.
+
+**There is one exception to note here. When the primary being evaluated meets quorum, it will still be fenced in the event a conflict is found.  This is to protect against a possible race condition where the old primary comes back up during an active failover.**
+
+Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362
+
+## What if the real primary can't be resolved or doesn't match the booting primary?
+
+In both of these instances the primary member will be fenced.
+
+**If the real primary is resolvable**
+The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a `zombie.lock` file.  The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions.  Once completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from the `zombie.lock` file and use that to attempt to rejoin the cluster we diverged from.  If we are successful, the `zombie.lock` is cleared and we will boot as a standby.
+
+**Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the `PRIMARY_REGION` environment variable set on self.  The `PRIMARY_REGION` will need to be updated before a rejoin will be attempted.**
+
+**If the real primary is NOT resolvable**
+The cluster will be made read-only, PGBouncer will remain disabled and a `zombie.lock` file will be created without a value.  When the member reboots, we will read the `zombie.lock` file and see that it's empty.  This indicates that we've entered a failure mode that can't be recovered automatically.  This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.
diff --git a/docs/manual_failovers.md b/docs/manual_failovers.md
@@ -0,0 +1,87 @@
+# Manual failover
+
+While automatic failures are already baked-in, there may be times where a manually issued failover is necessary. The steps to perform a manual failover are listed below:
+
+**Note: The promotion candidate must reside within your PRIMARY_REGION.**
+
+1. Connect to the Machine you wish to promote
+```
+fly ssh console -s <app-name>
+```
+
+2. Confirm the member is healthy.
+```
+# Switch to the postgres user and move to the home directory.
+su postgres
+cd ~
+
+# Verify member is healthy
+repmgr node check
+
+Node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2":
+	Server role: OK (node is standby)
+	Replication lag: OK (0 seconds)
+	WAL archiving: OK (0 pending archive ready files)
+	Upstream connection: OK (node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) is attached to expected upstream node "fdaa:0:2e26:a7b:c850:86cf:b175:2" (ID: 1177616922))
+	Downstream servers: OK (this node has no downstream nodes)
+	Replication slots: OK (node has no physical replication slots)
+	Missing physical replication slots: OK (node has no missing physical replication slots)
+	Configured data directory: OK (configured "data_directory" is "/data/postgresql")
+
+```
+
+2. Stop the Machine running as primary
+Open up a separate terminal and stop the Machine running `primary`.
+
+```
+# Identify the primary
+fly status --app <app-name>
+
+ID            	STATE  	ROLE   	REGION	HEALTH CHECKS     	IMAGE                                           	CREATED             	UPDATED
+6e8226ec711087	started	replica	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:20:51Z	2023-02-15T20:21:10Z
+6e82931b729087	started	primary	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:19:58Z	2023-02-15T20:20:18Z
+9185957f411683	started	replica	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:20:24Z	2023-02-15T20:20:45Z
+
+
+fly machines stop 6e82931b729087 --app <app-name>
+```
+
+3. Run the standby promotion command
+Go back to the first terminal you opened that's connected to your promotion candidate.
+
+**WARNING: It's important that you specify `--siblings-follow`, otherwise any other standbys will not be reconfigured to follow the new primary.**
+```
+# Issue a dry-run to ensure our candidate is eligible for promotion.
+repmgr standby promote --siblings-follow --dry-run
+
+INFO: node is a standby
+INFO: no active primary server found in this replication cluster
+INFO: all sibling nodes are reachable via SSH
+INFO: 1 walsenders required, 10 available
+INFO: 1 replication slots required, 10 available
+INFO: node will be promoted using the "pg_promote()" function
+INFO: prerequisites for executing STANDBY PROMOTE are met
+```
+
+If everything looks good, go ahead and re-run the command without the `--dry-run` argument.
+```
+repmgr standby promote --siblings-follow
+
+NOTICE: promoting standby to primary
+DETAIL: promoting server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) using pg_promote()
+NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
+NOTICE: STANDBY PROMOTE successful
+DETAIL: server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) was successfully promoted to primary
+INFO: executing notification command for event "standby_promote"
+DETAIL: command is:
+  /usr/local/bin/event_handler -node-id 1375486377 -event standby_promote -success 1 -details "server \"fdaa:0:2e26:a7b:7d17:9b36:6e4b:2\" (ID: 1375486377) was successfully promoted to primary" -new-node-id ''
+NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
+INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
+```
+
+4. Start the Machine that was previously operating as Primary
+```
+fly machines start 6e82931b729087 --app <app-name>
+```
+
+The primary will come back up and recognizing that it's no longer the true primary and will rejoin the cluster as a standby.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -0,0 +1,42 @@
+# Troubleshooting
+
+
+##  Member unregistration failed when removing machine
+
+```
+$ fly machines remove 9185340f4d3383 --app flex-testing
+machine 9185340f4d3383 was found and is currently in stopped state, attempting to destroy...
+unregistering postgres member 'fdaa:0:2e26:a7b:7d16:cff7:9849:2' from the cluster...  <insert-random-error-here> (failed)
+
+9185340f4d3383 has been destroyed
+```
+Unfortionately, this can happen for a variety of reasons. If no action is taken, the member and associated replication slot will automatically be cleaned up after 24 hours.  Depending on the current cluster size, problems can arise if the down member impacts the clusters ability to meet quorum. If this case, it's important to take action right away to prevent your cluster from going read-only.
+
+
+To address this, start by ssh'ing into one of your running Machines.
+
+```
+fly ssh console --app <app-name>
+```
+
+Switch to the postgres user and move into the home directory.
+```
+su postgres
+cd ~
+```
+
+Use the `rempgr` cli tool to view the current cluster state.
+```
+repmgr daemon status
+
+ ID | Name                             | Role    | Status        | Upstream                           | repmgrd | PID | Paused? | Upstream last seen
+----+----------------------------------+---------+---------------+------------------------------------+---------+-----+---------+--------------------
+ 376084936 | fdaa:0:2e26:a7b:7d18:1a68:804e:2 | primary | * running     |                                    | running | 630 | no      | n/a
+ 1349952263 | fdaa:0:2e26:a7b:7d17:4463:955d:2 | standby | ? unreachable | ? fdaa:0:2e26:a7b:7d18:1a68:804e:2 | n/a     | n/a | n/a     | n/a
+ 1412735685 | fdaa:0:2e26:a7b:c850:8f12:fb1d:2 | standby |   running     | fdaa:0:2e26:a7b:7d18:1a68:804e:2   | running | 617 | no      | 1 second(s) ago
+```
+
+Manually unregister the unreachable standby.
+```
+repmgr standby unregister --node-id 1349952263
+```