Skip to content
This repository was archived by the owner on Jul 30, 2021. It is now read-only.

Commit 8b303d7

Browse files
authored
Add disaster recovery documentation. (#584)
* Add disaster recovery documentation. * add asciicast. * Reference new --recovery-dir flag. * Address aaronlevy feedback.
1 parent 5fe418b commit 8b303d7

File tree

2 files changed

+116
-25
lines changed

2 files changed

+116
-25
lines changed

Documentation/disaster-recovery.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Disaster Recovery
2+
3+
Self-hosted Kubernetes clusters are vulnerable to the following catastrophic
4+
failure scenarios:
5+
6+
- Loss of all api-servers
7+
- Loss of all schedulers
8+
- Loss of all controller-managers
9+
- Loss of all self-hosted etcd nodes
10+
11+
To minimize the likelihood of any of the these scenarios, production
12+
self-hosted clusters should always run in a high-availability configuration
13+
(**TODO:** [add documentation for running high-availability self-hosted
14+
clusters](https://github.com/kubernetes-incubator/bootkube/issues/311)).
15+
16+
Nevertheless, in the event of a control plane loss the bootkube project
17+
provides limited disaster avoidance and recovery support through the
18+
`pod-checkpointer` program and the `bootkube recover` subcommand.
19+
20+
## Pod Checkpointer
21+
22+
The Pod Checkpointer is a program that ensures that existing local pod state
23+
can be recovered in the absence of an api-server.
24+
25+
This is accomplished by managing "checkpoints" of local pod state as static pod
26+
manifests:
27+
28+
- When the checkpointer sees that a "parent pod" (a pod which should be
29+
checkpointed), is successfully running, the checkpointer will save a local
30+
copy of the manifest.
31+
- If the parent pod is detected as no longer running, the checkpointer will
32+
"activate" the checkpoint manifest. It will allow the checkpoint to continue
33+
running until the parent-pod is restarted on the local node, or it is able to
34+
contact an api-server to determine that the parent pod is no longer scheduled
35+
to this node.
36+
37+
A Pod Checkpointer DaemonSet is deployed by default when using `bootkube
38+
render` to create cluster manifests. Using the Pod Checkpointer is highly
39+
recommended for all self-hosted clusters to ensure node reboot resiliency.
40+
41+
For more information, see the [Pod Checkpointer
42+
README](https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/checkpoint/README.md).
43+
44+
## Bootkube Recover
45+
46+
In the event of partial or total self-hosted control plane loss, `bootkube
47+
recover` may be able to assist in re-bootstrapping the self-hosted control
48+
plane.
49+
50+
The `bootkube recover` subcommand does not recover a cluster directly. Instead,
51+
it extracts the control plane configuration from an available source and
52+
renders manifests in a format that `bootkube start` can use invoked to reboot
53+
the cluster.
54+
55+
For best results always use the latest Bootkube release when using `recover`,
56+
regardless of which release was used to create the cluster. To see available
57+
options, run:
58+
59+
```
60+
bootkube recover --help
61+
```
62+
63+
To recover a cluster, first invoke `bootkube recover` with flags corresponding
64+
to the current state of the cluster (supported states listed below). Then,
65+
invoke `bootkube start` to reboot the cluster. For example:
66+
67+
```
68+
scp bootkube user@master-node:
69+
ssh user@master-node
70+
./bootkube recover --recovery-dir=recovered [scenario-specific options]
71+
sudo ./bootkube start --asset-dir=recovered
72+
```
73+
74+
For complete recovery examples see the
75+
[hack/multi-node/bootkube-test-recovery](https://github.com/kubernetes-incubator/bootkube/blob/master/hack/multi-node/bootkube-test-recovery)
76+
and
77+
[hack/multi-node/bootkube-test-recovery-self-hosted-etcd](https://github.com/kubernetes-incubator/bootkube/blob/master/hack/multi-node/bootkube-test-recovery-self-hosted-etcd)
78+
scripts. The `bootkube-test-recovery` script is demoed below.
79+
80+
[![asciicast](https://asciinema.org/a/dsp43ziuuzwcztni94y8l25s5.png)](https://asciinema.org/a/dsp43ziuuzwcztni94y8l25s5)
81+
82+
### If an api-server is still running
83+
84+
If an api-server is still running but other control plane components are down,
85+
preventing cluster functionality (i.e. the scheduler pods are all down), the
86+
control plane can be extracted directly from the api-server:
87+
88+
```
89+
bootkube recover --recovery-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig
90+
```
91+
### If an external etcd cluster is still running
92+
93+
If using an external (non-self-hosted) etcd cluster, the control plane can be
94+
extracted directly from etcd:
95+
96+
```
97+
bootkube recover --recovery-dir=recovered --etcd-servers=http://127.0.0.1:2379 --kubeconfig=/etc/kubernetes/kubeconfig
98+
```
99+
100+
### If an etcd backup is available (non-self-hosted etcd)
101+
102+
First, recover the external etcd cluster from the backup. Then use the method
103+
described in the previous section to recover the control plane manifests.
104+
105+
### If an etcd backup is available (self-hosted etcd)
106+
107+
If using self-hosted etcd, recovery is supported via reading from an etcd
108+
backup file:
109+
110+
```
111+
bootkube recover --recovery-dir=recovered --etcd-backup-file=backup --kubeconfig=/etc/kubernetes/kubeconfig
112+
```
113+
114+
In addition to rebooting the control plane, this will also destroy and recreate
115+
the self-hosted etcd cluster using the backup.

README.md

Lines changed: 1 addition & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -60,31 +60,7 @@ bootkube start --asset-dir=my-cluster
6060

6161
In the case of a partial or total control plane outage (i.e. due to lost master nodes) an experimental `recover` command can extract and write manifests from a backup location. These manifests can then be used by the `start` command to reboot the cluster. Currently recovery from a running apiserver, an external running etcd cluster, or an etcd backup taken from the self hosted etcd cluster are the methods.
6262

63-
To see available options, run:
64-
65-
```
66-
bootkube recover --help
67-
```
68-
69-
Recover from an external running etcd cluster:
70-
71-
```
72-
bootkube recover --recovery-dir=recovered --etcd-servers=http://127.0.0.1:2379 --kubeconfig=/etc/kubernetes/kubeconfig
73-
```
74-
75-
Recover from a running apiserver (i.e. if the scheduler pods are all down):
76-
77-
```
78-
bootkube recover --recovery-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig
79-
```
80-
81-
Recover from an etcd backup when self hosted etcd is enabled:
82-
83-
```
84-
bootkube recover --recovery-dir=recovered --etcd-backup-file=backup --kubeconfig=/etc/kubernetes/kubeconfig
85-
```
86-
87-
For a complete recovery example please see the [hack/multi-node/bootkube-test-recovery](hack/multi-node/bootkube-test-recovery) and the [hack/multi-node/bootkube-test-recovery-self-hosted-etcd](hack/multi-node/bootkube-test-recovery-self-hosted-etcd) scripts.
63+
For more details and examples see [disaster recovery documentation](Documentation/disaster-recovery.md).
8864

8965
## Building
9066

0 commit comments

Comments
 (0)