BZ-1882405: Adding etcd defrag docs

bergerhoffer · bergerhoffer · commit e4233ca0d648 · 2020-11-30T10:04:23.000-05:00
diff --git a/modules/etcd-defrag.adoc b/modules/etcd-defrag.adoc
@@ -0,0 +1,142 @@
+// Module included in the following assemblies:
+//
+// * post_installation_configuration/cluster-tasks.adoc
+
+[id="etcd-defrag_{context}"]
+= Defragmenting etcd data
+
+Manual defragmentation must be performed periodically to reclaim disk space after etcd history compaction and other events cause disk fragmentation.
+
+History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.
+
+Because etcd writes data to disk, its performance strongly depends on disk performance. Consider defragmenting etcd every month, twice a month, or as needed for your cluster. You can also monitor the `etcd_db_total_size_in_bytes` metric to determine whether defragmentation is necessary.
+
+[WARNING]
+====
+Defragmenting etcd is a blocking action. The etcd member will not response until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.
+====
+
+Follow this procedure to defragment etcd data on each etcd member.
+
+.Prerequisites
+
+* You have access to the cluster as a user with the `cluster-admin` role.
+
+.Procedure
+
+. Determine which etcd member is the leader, because the leader should be defragmented last.
+
+.. Get the list of etcd pods:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd
+----
++
+.Example output
+[source,terminal]
+----
+etcd-ip-10-0-159-225.example.redhat.com                3/3     Running     0          175m   10.0.159.225   ip-10-0-159-225.example.redhat.com   <none>           <none>
+etcd-ip-10-0-191-37.example.redhat.com                 3/3     Running     0          173m   10.0.191.37    ip-10-0-191-37.example.redhat.com    <none>           <none>
+etcd-ip-10-0-199-170.example.redhat.com                3/3     Running     0          176m   10.0.199.170   ip-10-0-199-170.example.redhat.com   <none>           <none>
+----
+
+.. Choose a pod and run the following command to determine which etcd member is the leader:
++
+[source,terminal]
+----
+$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.us-west-1.compute.internal etcdctl endpoint status --cluster -w table
+----
++
+.Example output
+[source,terminal]
+----
+Defaulting container name to etcdctl.
+Use 'oc describe pod/etcd-ip-10-0-159-225.example.redhat.com -n openshift-etcd' to see all of the containers in this pod.
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+|  https://10.0.191.37:2379 | 251cd44483d811c3 |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
+| https://10.0.159.225:2379 | 264c7c58ecbdabee |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
+| https://10.0.199.170:2379 | 9ac311f93915cc79 |   3.4.9 |  104 MB |      true |      false |         7 |      91624 |              91624 |        |
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----
++
+Based on the `IS LEADER` column of this output, the [x-]`https://10.0.199.170:2379` endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader is `etcd-ip-10-0-199-170.example.redhat.com`.
+
+. Defragment an etcd member.
+
+.. Connect to the running etcd container, passing in the name of a pod that is _not_ the leader:
++
+[source,terminal]
+----
+$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
+----
+
+.. Unset the `ETCDCTL_ENDPOINTS` environment variable:
++
+[source,terminal]
+----
+sh-4.4# unset ETCDCTL_ENDPOINTS
+----
+
+.. Defragment the etcd member:
++
+[source,terminal]
+----
+sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
+----
++
+.Example output
+[source,terminal]
+----
+Finished defragmenting etcd member[https://localhost:2379]
+----
++
+If a timeout error occurs, increase the value for `--command-timeout` until the command succeeds.
+
+.. Verify that the database size was reduced:
++
+[source,terminal]
+----
+sh-4.4# etcdctl endpoint status -w table --cluster
+----
++
+.Example output
+[source,terminal]
+----
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+|  https://10.0.191.37:2379 | 251cd44483d811c3 |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
+| https://10.0.159.225:2379 | 264c7c58ecbdabee |   3.4.9 |   41 MB |     false |      false |         7 |      91624 |              91624 |        | <1>
+| https://10.0.199.170:2379 | 9ac311f93915cc79 |   3.4.9 |  104 MB |      true |      false |         7 |      91624 |              91624 |        |
++---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----
+This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.
+
+.. Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.
++
+Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.
+
+. If any `NOSPACE` alarms were triggered due to the space quota being exceeded, clear them.
+
+.. Check if there are any `NOSPACE` alarms:
++
+[source,terminal]
+----
+sh-4.4# etcdctl alarm list
+----
++
+.Example output
+[source,terminal]
+----
+memberID:12345678912345678912 alarm:NOSPACE
+----
+
+.. Clear the alarms:
++
+[source,terminal]
+----
+sh-4.4# etcdctl alarm disarm
+----
diff --git a/post_installation_configuration/cluster-tasks.adoc b/post_installation_configuration/cluster-tasks.adoc
@@ -81,13 +81,14 @@ include::modules/nodes-cluster-enabling-features-cluster.adoc[leveloffset=+1]
 
 [id="post-install-etcd-tasks"]
 == etcd tasks
-Enable, disable, or back up etcd.
+Back up etcd, enable or disable etcd encryption, or defragment etcd data.
 
 include::modules/recommended-etcd-practices.adoc[leveloffset=+2]
 include::modules/about-etcd-encryption.adoc[leveloffset=+2]
 include::modules/enabling-etcd-encryption.adoc[leveloffset=+2]
 include::modules/disabling-etcd-encryption.adoc[leveloffset=+2]
 include::modules/backup-etcd.adoc[leveloffset=+2]
+include::modules/etcd-defrag.adoc[leveloffset=+2]
 include::modules/dr-restoring-cluster-state.adoc[leveloffset=+2]
 
 [id="post-install-pod-disruption-budgets"]