Skip to content

Commit 82f3ffa

Browse files
author
Lisa Pettyjohn
committed
BZ1876855 doc stor exception to etcd restore
1 parent d863edb commit 82f3ffa

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

modules/dr-restoring-cluster-state.adoc

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,3 +314,29 @@ etcd-ip-10-0-173-171.ec2.internal 2/2 Running 0
314314
----
315315

316316
Note that it might take several minutes after completing this procedure for all services to be restored. For example, authentication by using `oc login` might not immediately work until the OAuth server pods are restarted.
317+
318+
[id="dr-scenario-cluster-state-issues_{context}"]
319+
= Issues and workarounds for restoring a persistent storage state
320+
321+
If your {product-title} cluster uses persistent storage of any form, a state of the world is typically stored outside etcd. It might be an Elasticsearch cluster running in a pod or a database running in a `StatefulSet` object. When you restore from an etcd backup, the status of the workloads in {product-title} is also restored. However, if the etcd snapshot is old, the status might be invalid or outdated.
322+
323+
[IMPORTANT]
324+
====
325+
The contents of persistent volumes (PVs) are never part of the etcd snapshot. When you restore an {product-title} cluster from an etcd snapshot, non-critical workloads might gain access to critical data, or vice-versa.
326+
====
327+
328+
The following are some example scenarios that produce an out-of-date status:
329+
330+
* MySQL database is running in a pod backed up by a PV object. Restoring {product-title} from an etcd snapshot does not bring back the volume on the storage provider, and does not produce a running MySQL pod, despite the pod repeatedly attempting to start. You must manually restore this pod by restoring the volume on the storage provider, and then editing the PV to point to the new volume.
331+
332+
* Pod P1 is using volume A, which is attached to node X. If the etcd snapshot is taken while another pod uses the same volume on node Y, then when the etcd restore is performed, pod P1 might not be able to start correctly due to the volume still being attached to node Y. {product-title} is not aware of the attachment, and does not automatically detach it. When this occurs, the volume must be manually detached from node Y so that the volume can attach on node X, and then pod P1 can start.
333+
334+
* Cloud provider or storage provider credentials were updated after the etcd snapshot was taken. This causes any CSI drivers or Operators that depend on the those credentials to not work. You might have to manually update the credentials required by those drivers or Operators.
335+
336+
* A device is removed or renamed from {product-title} nodes after the etcd snapshot is taken. The Local Storage Operator creates symlinks for each PV that it manages from `/dev/disk/by-id` or `/dev` directories. This situation might cause the local PVs to refer to devices that no longer exist.
337+
+
338+
To fix this problem, an administrator must:
339+
340+
. Manually remove the PVs with invalid devices.
341+
. Remove symlinks from respective nodes.
342+
. Delete `LocalVolume` or `LocalVolumeSet` objects (see _Storage_ -> _Configuring persistent storage_ -> _Persistent storage using local volumes_ -> _Deleting the Local Storage Operator Resources_).

0 commit comments

Comments
 (0)