|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * logging/cluster-logging-loki.adoc |
| 4 | + |
| 5 | +:_mod-docs-content-type: PROCEDURE |
| 6 | +[id="logging-loki-zone-fail-recovery_{context}"] |
| 7 | += Recovering Loki pods from failed zones |
| 8 | + |
| 9 | +In {product-title} a zone failure happens when specific availability zone resources become inaccessible. Availability zones are isolated areas within a cloud provider's data center, aimed at enhancing redundancy and fault tolerance. If your {product-title} cluster isn't configured to handle this, a zone failure can lead to service or data loss. |
| 10 | + |
| 11 | +Loki pods are part of a link:https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/[StatefulSet], and they come with Persistent Volume Claims (PVCs) provisioned by a `StorageClass` object. Each Loki pod and its PVCs reside in the same zone. When a zone failure occurs in a cluster, the StatefulSet controller automatically attempts to recover the affected pods in the failed zone. |
| 12 | + |
| 13 | +[WARNING] |
| 14 | +==== |
| 15 | +The following procedure will delete the PVCs in the failed zone, and all data contained therein. To avoid complete data loss the replication factor field of the `LokiStack` CR should always be set to a value greater than 1 to ensure that Loki is replicating. |
| 16 | +==== |
| 17 | + |
| 18 | +.Prerequisites |
| 19 | +* Logging version 5.8 or later. |
| 20 | +* Verify your `LokiStack` CR has a replication factor greater than 1. |
| 21 | +* Zone failure detected by the control plane, and nodes in the failed zone are marked by cloud provider integration. |
| 22 | +
|
| 23 | +The StatefulSet controller automatically attempts to reschedule pods in a failed zone. Because the associated PVCs are also in the failed zone, automatic rescheduling to a different zone does not work. You must manually delete the PVCs in the failed zone to allow successful re-creation of the stateful Loki Pod and its provisioned PVC in the new zone. |
| 24 | + |
| 25 | + |
| 26 | +.Procedure |
| 27 | +. List the pods in `Pending` status by running the following command: |
| 28 | ++ |
| 29 | +[source,terminal] |
| 30 | +---- |
| 31 | +oc get pods --field-selector status.phase==Pending -n openshift-logging |
| 32 | +---- |
| 33 | ++ |
| 34 | +.Example `oc get pods` output |
| 35 | +[source,terminal] |
| 36 | +---- |
| 37 | +NAME READY STATUS RESTARTS AGE # <1> |
| 38 | +logging-loki-index-gateway-1 0/1 Pending 0 17m |
| 39 | +logging-loki-ingester-1 0/1 Pending 0 16m |
| 40 | +logging-loki-ruler-1 0/1 Pending 0 16m |
| 41 | +---- |
| 42 | +<1> These pods are in `Pending` status because their corresponding PVCs are in the failed zone. |
| 43 | + |
| 44 | +.. List the PVCs in `Pending` status by running the following command: |
| 45 | ++ |
| 46 | +[source,terminal] |
| 47 | +---- |
| 48 | +oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r |
| 49 | +---- |
| 50 | ++ |
| 51 | +.Example `oc get pvc` output |
| 52 | +[source,terminal] |
| 53 | +---- |
| 54 | +storage-logging-loki-index-gateway-1 |
| 55 | +storage-logging-loki-ingester-1 |
| 56 | +wal-logging-loki-ingester-1 |
| 57 | +storage-logging-loki-ruler-1 |
| 58 | +wal-logging-loki-ruler-1 |
| 59 | +---- |
| 60 | + |
| 61 | +.. Delete the PVC(s) for a pod by running the following command: |
| 62 | ++ |
| 63 | +[source,terminal] |
| 64 | +---- |
| 65 | +oc delete pvc __<pvc_name>__ -n openshift-logging |
| 66 | +---- |
| 67 | ++ |
| 68 | +.. Then delete the pod(s) by running the following command: |
| 69 | +[source,terminal] |
| 70 | +---- |
| 71 | +oc delete pod __<pod_name>__ -n openshift-logging |
| 72 | +---- |
| 73 | + |
| 74 | +Once these objects have been successfully deleted, they should automatically be rescheduled in an available zone. |
| 75 | + |
| 76 | +[id="logging-loki-zone-fail-term-state_{context}"] |
| 77 | +== Troubleshooting PVC in a terminating state |
| 78 | + |
| 79 | +The PVCs might hang in the terminating state without being deleted, if PVC metadata finalizers are set to `kubernetes.io/pv-protection`. Removing the finalizers should allow the PVCs to delete successfully. |
| 80 | + |
| 81 | +. Remove the finalizer for each PVC by running the command below, then retry deletion. |
| 82 | + |
| 83 | +[source,terminal] |
| 84 | +---- |
| 85 | +oc patch pvc __<pvc_name>__ -p '{"metadata":{"finalizers":null}}' -n openshift-logging |
| 86 | +---- |
0 commit comments