Skip to content

Commit e4233ca

Browse files
committed
BZ-1882405: Adding etcd defrag docs
1 parent 35b3f31 commit e4233ca

File tree

2 files changed

+144
-1
lines changed

2 files changed

+144
-1
lines changed

modules/etcd-defrag.adoc

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * post_installation_configuration/cluster-tasks.adoc
4+
5+
[id="etcd-defrag_{context}"]
6+
= Defragmenting etcd data
7+
8+
Manual defragmentation must be performed periodically to reclaim disk space after etcd history compaction and other events cause disk fragmentation.
9+
10+
History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.
11+
12+
Because etcd writes data to disk, its performance strongly depends on disk performance. Consider defragmenting etcd every month, twice a month, or as needed for your cluster. You can also monitor the `etcd_db_total_size_in_bytes` metric to determine whether defragmentation is necessary.
13+
14+
[WARNING]
15+
====
16+
Defragmenting etcd is a blocking action. The etcd member will not response until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.
17+
====
18+
19+
Follow this procedure to defragment etcd data on each etcd member.
20+
21+
.Prerequisites
22+
23+
* You have access to the cluster as a user with the `cluster-admin` role.
24+
25+
.Procedure
26+
27+
. Determine which etcd member is the leader, because the leader should be defragmented last.
28+
29+
.. Get the list of etcd pods:
30+
+
31+
[source,terminal]
32+
----
33+
$ oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd
34+
----
35+
+
36+
.Example output
37+
[source,terminal]
38+
----
39+
etcd-ip-10-0-159-225.example.redhat.com 3/3 Running 0 175m 10.0.159.225 ip-10-0-159-225.example.redhat.com <none> <none>
40+
etcd-ip-10-0-191-37.example.redhat.com 3/3 Running 0 173m 10.0.191.37 ip-10-0-191-37.example.redhat.com <none> <none>
41+
etcd-ip-10-0-199-170.example.redhat.com 3/3 Running 0 176m 10.0.199.170 ip-10-0-199-170.example.redhat.com <none> <none>
42+
----
43+
44+
.. Choose a pod and run the following command to determine which etcd member is the leader:
45+
+
46+
[source,terminal]
47+
----
48+
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.us-west-1.compute.internal etcdctl endpoint status --cluster -w table
49+
----
50+
+
51+
.Example output
52+
[source,terminal]
53+
----
54+
Defaulting container name to etcdctl.
55+
Use 'oc describe pod/etcd-ip-10-0-159-225.example.redhat.com -n openshift-etcd' to see all of the containers in this pod.
56+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
57+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
58+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
59+
| https://10.0.191.37:2379 | 251cd44483d811c3 | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | |
60+
| https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | |
61+
| https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.4.9 | 104 MB | true | false | 7 | 91624 | 91624 | |
62+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
63+
----
64+
+
65+
Based on the `IS LEADER` column of this output, the [x-]`https://10.0.199.170:2379` endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader is `etcd-ip-10-0-199-170.example.redhat.com`.
66+
67+
. Defragment an etcd member.
68+
69+
.. Connect to the running etcd container, passing in the name of a pod that is _not_ the leader:
70+
+
71+
[source,terminal]
72+
----
73+
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
74+
----
75+
76+
.. Unset the `ETCDCTL_ENDPOINTS` environment variable:
77+
+
78+
[source,terminal]
79+
----
80+
sh-4.4# unset ETCDCTL_ENDPOINTS
81+
----
82+
83+
.. Defragment the etcd member:
84+
+
85+
[source,terminal]
86+
----
87+
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
88+
----
89+
+
90+
.Example output
91+
[source,terminal]
92+
----
93+
Finished defragmenting etcd member[https://localhost:2379]
94+
----
95+
+
96+
If a timeout error occurs, increase the value for `--command-timeout` until the command succeeds.
97+
98+
.. Verify that the database size was reduced:
99+
+
100+
[source,terminal]
101+
----
102+
sh-4.4# etcdctl endpoint status -w table --cluster
103+
----
104+
+
105+
.Example output
106+
[source,terminal]
107+
----
108+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
109+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
110+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
111+
| https://10.0.191.37:2379 | 251cd44483d811c3 | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | |
112+
| https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.4.9 | 41 MB | false | false | 7 | 91624 | 91624 | | <1>
113+
| https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.4.9 | 104 MB | true | false | 7 | 91624 | 91624 | |
114+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
115+
----
116+
This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.
117+
118+
.. Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.
119+
+
120+
Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.
121+
122+
. If any `NOSPACE` alarms were triggered due to the space quota being exceeded, clear them.
123+
124+
.. Check if there are any `NOSPACE` alarms:
125+
+
126+
[source,terminal]
127+
----
128+
sh-4.4# etcdctl alarm list
129+
----
130+
+
131+
.Example output
132+
[source,terminal]
133+
----
134+
memberID:12345678912345678912 alarm:NOSPACE
135+
----
136+
137+
.. Clear the alarms:
138+
+
139+
[source,terminal]
140+
----
141+
sh-4.4# etcdctl alarm disarm
142+
----

post_installation_configuration/cluster-tasks.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,14 @@ include::modules/nodes-cluster-enabling-features-cluster.adoc[leveloffset=+1]
8181

8282
[id="post-install-etcd-tasks"]
8383
== etcd tasks
84-
Enable, disable, or back up etcd.
84+
Back up etcd, enable or disable etcd encryption, or defragment etcd data.
8585

8686
include::modules/recommended-etcd-practices.adoc[leveloffset=+2]
8787
include::modules/about-etcd-encryption.adoc[leveloffset=+2]
8888
include::modules/enabling-etcd-encryption.adoc[leveloffset=+2]
8989
include::modules/disabling-etcd-encryption.adoc[leveloffset=+2]
9090
include::modules/backup-etcd.adoc[leveloffset=+2]
91+
include::modules/etcd-defrag.adoc[leveloffset=+2]
9192
include::modules/dr-restoring-cluster-state.adoc[leveloffset=+2]
9293

9394
[id="post-install-pod-disruption-budgets"]

0 commit comments

Comments
 (0)