@@ -27,14 +27,14 @@ content_type: task
27
27
* Ensure that no resource starvation occurs.
28
28
29
29
Performance and stability of the cluster is sensitive to network and disk
30
- IO . Any resource starvation can lead to heartbeat timeout, causing instability
30
+ I/O . Any resource starvation can lead to heartbeat timeout, causing instability
31
31
of the cluster. An unstable etcd indicates that no leader is elected. Under
32
32
such circumstances, a cluster cannot make any changes to its current state,
33
33
which implies no new pods can be scheduled.
34
34
35
- * Keeping stable etcd clusters is critical to the stability of Kubernetes
35
+ * Keeping etcd clusters stable is critical to the stability of Kubernetes
36
36
clusters. Therefore, run etcd clusters on dedicated machines or isolated
37
- environments for [ guaranteed resource requirements] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/hardware.md#hardware-recommendations ) .
37
+ environments for [ guaranteed resource requirements] ( https://etcd.io/docs/current/ op-guide/hardware/ ) .
38
38
39
39
* The minimum recommended version of etcd to run in production is ` 3.2.10+ ` .
40
40
@@ -43,7 +43,7 @@ content_type: task
43
43
Operating etcd with limited resources is suitable only for testing purposes.
44
44
For deploying in production, advanced hardware configuration is required.
45
45
Before deploying etcd in production, see
46
- [ resource requirement reference documentation ] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/hardware.md #example-hardware-configurations ) .
46
+ [ resource requirement reference] ( https://etcd.io/docs/current/ op-guide/hardware/ #example-hardware-configurations ) .
47
47
48
48
## Starting etcd clusters
49
49
@@ -60,7 +60,7 @@ Use a single-node etcd cluster only for testing purpose.
60
60
--advertise-client-urls=http://$PRIVATE_IP :2379
61
61
```
62
62
63
- 2 . Start Kubernetes API server with the flag
63
+ 2 . Start the Kubernetes API server with the flag
64
64
` --etcd-servers=$PRIVATE_IP:2379 ` .
65
65
66
66
Make sure ` PRIVATE_IP ` is set to your etcd client IP.
@@ -69,12 +69,12 @@ Use a single-node etcd cluster only for testing purpose.
69
69
70
70
For durability and high availability, run etcd as a multi-node cluster in
71
71
production and back it up periodically. A five-member cluster is recommended
72
- in production. For more information, see [ FAQ
73
- Documentation ] ( https://github.com/coreos/etcd/blob/master/Documentation/ faq.md #what-is-failure-tolerance ) .
72
+ in production. For more information, see
73
+ [ FAQ documentation ] ( https://etcd.io/docs/current/ faq/ #what-is-failure-tolerance ) .
74
74
75
75
Configure an etcd cluster either by static member information or by dynamic
76
- discovery. For more information on clustering, see [ etcd Clustering
77
- Documentation ] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/clustering.md ) .
76
+ discovery. For more information on clustering, see
77
+ [ etcd clustering documentation ] ( https://etcd.io/docs/current/ op-guide/clustering/ ) .
78
78
79
79
For an example, consider a five-member etcd cluster running with the following
80
80
client URLs: ` http://$IP1:2379 ` , ` http://$IP2:2379 ` , ` http://$IP3:2379 ` ,
@@ -86,10 +86,10 @@ client URLs: `http://$IP1:2379`, `http://$IP2:2379`, `http://$IP3:2379`,
86
86
etcd --listen-client-urls=http://$IP1 :2379,http://$IP2 :2379,http://$IP3 :2379,http://$IP4 :2379,http://$IP5 :2379 --advertise-client-urls=http://$IP1 :2379,http://$IP2 :2379,http://$IP3 :2379,http://$IP4 :2379,http://$IP5 :2379
87
87
```
88
88
89
- 2 . Start Kubernetes API servers with the flag
89
+ 2 . Start the Kubernetes API servers with the flag
90
90
` --etcd-servers=$IP1:2379,$IP2:2379,$IP3:2379,$IP4:2379,$IP5:2379 ` .
91
91
92
- Replace ` IP<n> ` with your client IP addresses.
92
+ Make sure the ` IP<n> ` variables are set to your client IP addresses.
93
93
94
94
### Multi-node etcd cluster with load balancer
95
95
@@ -121,16 +121,16 @@ authentication.
121
121
122
122
To configure etcd with secure peer communication, specify flags
123
123
` --peer-key-file=peer.key ` and ` --peer-cert-file=peer.cert ` , and use HTTPS as
124
- URL schema.
124
+ the URL schema.
125
125
126
126
Similarly, to configure etcd with secure client communication, specify flags
127
127
` --key-file=k8sclient.key ` and ` --cert-file=k8sclient.cert ` , and use HTTPS as
128
- URL schema.
128
+ the URL schema.
129
129
130
130
### Limiting access of etcd clusters
131
131
132
132
After configuring secure communication, restrict the access of etcd cluster to
133
- only the Kubernetes API server . Use TLS authentication to do so.
133
+ only the Kubernetes API servers . Use TLS authentication to do so.
134
134
135
135
For example, consider key pairs ` k8sclient.key ` and ` k8sclient.cert ` that are
136
136
trusted by the CA ` etcd.ca ` . When etcd is configured with ` --client-cert-auth `
@@ -140,7 +140,7 @@ or the CA passed in by `--trusted-ca-file` flag. Specifying flags
140
140
access to clients with the certificate ` k8sclient.cert ` .
141
141
142
142
Once etcd is configured correctly, only clients with valid certificates can
143
- access it. To give Kubernetes API server the access, configure it with the
143
+ access it. To give Kubernetes API servers the access, configure them with the
144
144
flags ` --etcd-certfile=k8sclient.cert ` ,` --etcd-keyfile=k8sclient.key ` and
145
145
` --etcd-cafile=ca.cert ` .
146
146
@@ -160,11 +160,11 @@ member.
160
160
161
161
Though etcd keeps unique member IDs internally, it is recommended to use a
162
162
unique name for each member to avoid human errors. For example, consider a
163
- three-member etcd cluster. Let the URLs be, member1=http://10.0.0.1 ,
164
- member2=http://10.0.0.2 , and member3=http://10.0.0.3 . When member1 fails,
165
- replace it with member4=http://10.0.0.4 .
163
+ three-member etcd cluster. Let the URLs be, ` member1=http://10.0.0.1 ` ,
164
+ ` member2=http://10.0.0.2 ` , and ` member3=http://10.0.0.3 ` . When ` member1 ` fails,
165
+ replace it with ` member4=http://10.0.0.4 ` .
166
166
167
- 1 . Get the member ID of the failed member1:
167
+ 1 . Get the member ID of the failed ` member1 ` :
168
168
169
169
``` shell
170
170
etcdctl --endpoints=http://10.0.0.2,http://10.0.0.3 member list
@@ -213,21 +213,22 @@ replace it with member4=http://10.0.0.4.
213
213
214
214
5 . Do either of the following:
215
215
216
- 1 . Update its ` --etcd-servers ` flag to make Kubernetes aware of the
217
- configuration changes, then restart the Kubernetes API server.
216
+ 1 . Update the ` --etcd-servers ` flag for the Kubernetes API servers to make
217
+ Kubernetes aware of the configuration changes, then restart the
218
+ Kubernetes API servers.
218
219
2 . Update the load balancer configuration if a load balancer is used in the
219
220
deployment.
220
221
221
- For more information on cluster reconfiguration, see [ etcd Reconfiguration
222
- Documentation ] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/runtime-configuration.md #remove-a-member ) .
222
+ For more information on cluster reconfiguration, see
223
+ [ etcd reconfiguration documentation ] ( https://etcd.io/docs/current/ op-guide/runtime-configuration/ #remove-a-member ) .
223
224
224
225
## Backing up an etcd cluster
225
226
226
227
All Kubernetes objects are stored on etcd. Periodically backing up the etcd
227
228
cluster data is important to recover Kubernetes clusters under disaster
228
- scenarios, such as losing all master nodes. The snapshot file contains all the
229
- Kubernetes states and critical information. In order to keep the sensitive
230
- Kubernetes data safe, encrypt the snapshot files.
229
+ scenarios, such as losing all control plane nodes. The snapshot file contains
230
+ all the Kubernetes states and critical information. In order to keep the
231
+ sensitive Kubernetes data safe, encrypt the snapshot files.
231
232
232
233
Backing up an etcd cluster can be accomplished in two ways: etcd built-in
233
234
snapshot and volume snapshot.
@@ -236,10 +237,10 @@ snapshot and volume snapshot.
236
237
237
238
etcd supports built-in snapshot. A snapshot may either be taken from a live
238
239
member with the ` etcdctl snapshot save ` command or by copying the
239
- ` member/snap/db ` file from an etcd [ data
240
- directory] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/configuration.md #--data-dir )
240
+ ` member/snap/db ` file from an etcd
241
+ [ data directory] ( https://etcd.io/docs/current/ op-guide/configuration/ #--data-dir )
241
242
that is not currently used by an etcd process. Taking the snapshot will
242
- normally not affect the performance of the member.
243
+ not affect the performance of the member.
243
244
244
245
Below is an example for taking a snapshot of the keyspace served by
245
246
` $ENDPOINT ` to the file ` snapshotdb ` :
@@ -278,8 +279,8 @@ five-member etcd cluster for production Kubernetes clusters at any officially
278
279
supported scale.
279
280
280
281
A reasonable scaling is to upgrade a three-member cluster to a five-member
281
- one, when more reliability is desired. See [ etcd Reconfiguration
282
- Documentation ] ( https://github.com/coreos/etcd/blob/master/Documentation/ op-guide/runtime-configuration.md #remove-a-member )
282
+ one, when more reliability is desired. See
283
+ [ etcd reconfiguration documentation ] ( https://etcd.io/docs/current/ op-guide/runtime-configuration/ #remove-a-member )
283
284
for information on how to add members into an existing cluster.
284
285
285
286
## Restoring an etcd cluster
@@ -290,16 +291,14 @@ different patch version of etcd also is supported. A restore operation is
290
291
employed to recover the data of a failed cluster.
291
292
292
293
Before starting the restore operation, a snapshot file must be present. It can
293
- either be a snapshot file from a previous backup operation, or from a
294
- remaining [ data
295
- directory] ( https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#--data-dir ) .
296
- For more information and examples on restoring a cluster from a snapshot file,
297
- see [ etcd disaster recovery
298
- documentation] ( https://github.com/coreos/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster ) .
294
+ either be a snapshot file from a previous backup operation, or from a remaining
295
+ [ data directory] ( https://etcd.io/docs/current/op-guide/configuration/#--data-dir ) .
296
+ For more information and examples on restoring a cluster from a snapshot file, see
297
+ [ etcd disaster recovery documentation] ( https://etcd.io/docs/current/op-guide/recovery/#restoring-a-cluster ) .
299
298
300
299
If the access URLs of the restored cluster is changed from the previous
301
300
cluster, the Kubernetes API server must be reconfigured accordingly. In this
302
- case, restart Kubernetes API server with the flag
301
+ case, restart Kubernetes API servers with the flag
303
302
` --etcd-servers=$NEW_ETCD_CLUSTER ` instead of the flag
304
303
` --etcd-servers=$OLD_ETCD_CLUSTER ` . Replace ` $NEW_ETCD_CLUSTER ` and
305
304
` $OLD_ETCD_CLUSTER ` with the respective IP addresses. If a load balancer is
@@ -310,20 +309,19 @@ If the majority of etcd members have permanently failed, the etcd cluster is
310
309
considered failed. In this scenario, Kubernetes cannot make any changes to its
311
310
current state. Although the scheduled pods might continue to run, no new pods
312
311
can be scheduled. In such cases, recover the etcd cluster and potentially
313
- reconfigure Kubernetes API server to fix the issue.
312
+ reconfigure Kubernetes API servers to fix the issue.
314
313
315
314
{{< note >}}
316
315
If any API servers are running in your cluster, you should not attempt to
317
- restore instances of etcd.
318
- Instead, follow these steps to restore etcd:
316
+ restore instances of etcd. Instead, follow these steps to restore etcd:
319
317
320
- - stop * all* kube-apiserver instances
318
+ - stop * all* API server instances
321
319
- restore state in all etcd instances
322
- - restart all kube-apiserver instances
320
+ - restart all API server instances
323
321
324
- We also recommend restarting any components (e.g. kube-scheduler,
325
- kube-controller-manager, kubelet) to ensure that they don't rely on some stale
326
- data. Note that in practice, the restore takes a bit of time. During the
322
+ We also recommend restarting any components (e.g. ` kube-scheduler ` ,
323
+ ` kube-controller-manager ` , ` kubelet ` ) to ensure that they don't rely on some
324
+ stale data. Note that in practice, the restore takes a bit of time. During the
327
325
restoration, critical components will lose leader lock and restart themselves.
328
326
{{< /note >}}
329
327
@@ -364,9 +362,9 @@ properly work against secure endpoints. As a result, etcd servers may fail or
364
362
disconnect briefly from the kube-apiserver. This affects kube-apiserver HA
365
363
deployments.
366
364
367
- The fix was made in [ etcd v3.4] ( https://github.com/etcd-io/etcd/pull/10911 )
368
- (and backported to v3.3.14 or later): the new client now creates its own
369
- credential bundle to correctly set authority target in dial function.
365
+ The fix was made in etcd v3.4 (and backported to v3.3.14 or later): the new
366
+ client now creates its own credential bundle to correctly set authority target
367
+ in dial function.
370
368
371
369
Because the fix requires gRPC dependency upgrade (to v1.23.0), downstream
372
370
Kubernetes [ did not backport etcd
0 commit comments