Adding some basic docs

davissp14 · davissp14 · commit d82bf44eea7a · 2023-02-15T14:11:33.000-06:00
diff --git a/docs/capacity_monitoring.md b/docs/capacity_monitoring.md
@@ -0,0 +1,36 @@
+# Capacity monitoring
+
+Disk capacity is monitored at regular intervals. When capacity exceeds the pre-defined threshold of 90%, every user-defined table will become read-only. When disk usage falls below the defined threshold, either through file cleanup or volume extension read/write will be re-enabled automatically.
+
+## Resolving disk capacity issues
+Disk capacity must be brought below 90% before read/writes will be re-enabled. The best way to do this is to simply extend your volume.
+
+**List your volumes**
+```
+fly volumes list --app flex-testing
+
+ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM   	CREATED AT
+vol_okgj54527584y2wz	created	pg_data	10GB	lax   	1581	true     	9185340f4d3383	59 minutes ago
+```
+
+**Extend the volume**
+```
+$ fly volumes extend vol_w0enxv3o9pov8okp --size 15 --app flex-testing
+        ID: vol_okgj54527584y2wz
+      Name: pg_data
+       App: flex-testing
+    Region: lax
+      Zone: c6d5
+   Size GB: 15
+ Encrypted: true
+Created at: 15 Feb 23 15:23 UTC
+
+You will need to stop and start your machine to increase the size of the FS
+```
+
+**Restart the Machine tied to your Volume**
+```
+fly machines restart 9185340f4d3383 --app flex-testing
+```
+
+
diff --git a/docs/fencing.md b/docs/fencing.md
@@ -0,0 +1,43 @@
+# Fencing
+
+## How do we verify the real primary?
+We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
+
+The "clusters state" is represented across a few different dimensions:
+
+**Total members**
+Number of registered members, including the primary.
+
+**Total active members**
+Number of members that are responsive.  This includes the primary we are evaluating, so this will never be less than one.
+
+**Total inactive members**
+Number of registered members that are non-responsive.
+
+**Conflict map**
+The conflict map is a `map[string]int` that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.
+
+As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match.  This will be recorded as:
+```
+map[string]int{
+  "fdaa:0:2e26:a7b:8c31:bf37:488c:2": 2
+}
+```
+
+The real primary is resolvable so long as the majority of members can agree on who it is.  Quorum being defined as `total_members / 2 + 1`.
+
+**There is one exception to note here. When the primary being evaluated meets quorum, it will still be fenced in the event a conflict is found.  This is to protect against a possible race condition where the old primary comes back up during an active failover.**
+
+Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362
+
+## What if the real primary can't be resolved or doesn't match the booting primary?
+
+In both of these instances the primary member will be fenced.
+
+**If the real primary is resolvable**
+The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a `zombie.lock` file.  The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions.  Once completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from the `zombie.lock` file and use that to attempt to rejoin the cluster we diverged from.  If we are successful, the `zombie.lock` is cleared and we will boot as a standby.
+
+**Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the `PRIMARY_REGION` environment variable set on self.  The `PRIMARY_REGION` will need to be updated before a rejoin will be attempted.**
+
+**If the real primary is NOT resolvable**
+The cluster will be made read-only, PGBouncer will remain disabled and a `zombie.lock` file will be created without a value.  When the member reboots, we will read the `zombie.lock` file and see that it's empty.  This indicates that we've entered a failure mode that can't be recovered automatically.  This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -0,0 +1,42 @@
+# Troubleshooting
+
+
+##  Member unregistration failed when removing machine
+
+```
+$ fly machines remove 9185340f4d3383 --app flex-testing
+machine 9185340f4d3383 was found and is currently in stopped state, attempting to destroy...
+unregistering postgres member 'fdaa:0:2e26:a7b:7d16:cff7:9849:2' from the cluster...  <insert-random-error-here> (failed)
+
+9185340f4d3383 has been destroyed
+```
+Unfortionately, this can happen for a variety of reasons. If no action is taken, the member and associated replication slot will automatically be cleaned up after 24 hours.  Depending on the current cluster size, problems can arise if the down member impacts the clusters ability to meet quorum. If this case, it's important to take action right away to prevent your cluster from going read-only.
+
+
+To address this, start by ssh'ing into one of your running Machines.
+
+```
+fly ssh console --app <app-name>
+```
+
+Switch to the postgres user and move into the home directory.
+```
+su postgres
+cd ~
+```
+
+Use the `rempgr` cli tool to view the current cluster state.
+```
+repmgr daemon status
+
+ ID | Name                             | Role    | Status        | Upstream                           | repmgrd | PID | Paused? | Upstream last seen
+----+----------------------------------+---------+---------------+------------------------------------+---------+-----+---------+--------------------
+ 376084936 | fdaa:0:2e26:a7b:7d18:1a68:804e:2 | primary | * running     |                                    | running | 630 | no      | n/a
+ 1349952263 | fdaa:0:2e26:a7b:7d17:4463:955d:2 | standby | ? unreachable | ? fdaa:0:2e26:a7b:7d18:1a68:804e:2 | n/a     | n/a | n/a     | n/a
+ 1412735685 | fdaa:0:2e26:a7b:c850:8f12:fb1d:2 | standby |   running     | fdaa:0:2e26:a7b:7d18:1a68:804e:2   | running | 617 | no      | 1 second(s) ago
+```
+
+Manually unregister the unreachable standby.
+```
+repmgr standby unregister --node-id 1349952263
+```