Skip to content

Commit f308402

Browse files
committed
OBSDOCS-174 - Loki Zone Failure Recovery - w peer rev
1 parent 39722fa commit f308402

File tree

3 files changed

+133
-2
lines changed

3 files changed

+133
-2
lines changed

logging/cluster-logging-loki.adoc

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Loki is a horizontally scalable, highly available, multi-tenant log aggregation
1212

1313
include::modules/loki-deployment-sizing.adoc[leveloffset=+1]
1414

15-
include::modules/cluster-logging-loki-deploy.adoc[leveloffset=+1]
15+
//include::modules/cluster-logging-loki-deploy.adoc[leveloffset=+1]
1616

1717
include::modules/logging-creating-new-group-cluster-admin-user-role.adoc[leveloffset=+1]
1818

@@ -33,8 +33,21 @@ include::modules/logging-loki-reliability-hardening.adoc[leveloffset=+1]
3333
* link:https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#podantiaffinity-v1-core[`PodAntiAffinity` v1 core Kubernetes documentation]
3434
* link:https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity[Assigning Pods to Nodes Kubernetes documentation]
3535
36-
ifdef::openshift-enterprise[]
3736
* xref:../nodes/scheduling/nodes-scheduler-pod-affinity.adoc#nodes-scheduler-pod-affinity[Placing pods relative to other pods using affinity and anti-affinity rules]
37+
38+
39+
include::modules/logging-loki-zone-aware-rep.adoc[leveloffset=+1]
40+
41+
include::modules/logging-loki-zone-fail-recovery.adoc[leveloffset=+2]
42+
43+
[role="_additional-resources"]
44+
.Additional resources
45+
* link:https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#spread-constraint-definition[Topology spread constraints Kubernetes documentation]
46+
47+
* link:https://kubernetes.io/docs/setup/best-practices/multiple-zones/#storage-access-for-zones[Kubernetes storage documentation].
48+
49+
ifdef::openshift-enterprise[]
50+
* xref:../nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.adoc#nodes-scheduler-pod-topology-spread-constraints-configuring[Controlling pod placement by using pod topology spread constraints]
3851
endif::[]
3952

4053
include::modules/logging-loki-retention.adoc[leveloffset=+1]
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * logging/cluster-logging-loki.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="logging-loki-zone-aware-rep_{context}"]
7+
= Zone aware data replication
8+
9+
In the {logging} 5.8 and later versions, the Loki Operator offers support for zone-aware data replication through pod topology spread constraints. Enabling this feature enhances reliability and safeguards against log loss in the event of a single zone failure. When configuring the deployment size as `1x.extra.small`, `1x.small`, or `1x.medium,` the `replication.factor` field is automatically set to 2.
10+
11+
To ensure proper replication, you need to have at least as many availability zones as the replication factor specifies. While it is possible to have more availability zones than the replication factor, having fewer zones can lead to write failures. Each zone should host an equal number of instances for optimal operation.
12+
13+
.Example LokiStack CR with zone replication enabled
14+
[source,yaml]
15+
----
16+
apiVersion: loki.grafana.com/v1
17+
kind: LokiStack
18+
metadata:
19+
name: logging-loki
20+
namespace: openshift-logging
21+
spec:
22+
replicationFactor: 2 # <1>
23+
replication:
24+
factor: 2 # <2>
25+
zones:
26+
- maxSkew: 1 # <3>
27+
topologyKey: topology.kubernetes.io/zone # <4>
28+
----
29+
<1> Deprecated field, values entered are overwritten by `replication.factor`.
30+
<2> This value is automatically set when deployment size is selected at setup.
31+
<3> The maximum difference in number of pods between any two topology domains. The default is 1, and you cannot specify a value of 0.
32+
<4> Defines zones in the form of a topology key that corresponds to a node label.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * logging/cluster-logging-loki.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="logging-loki-zone-fail-recovery_{context}"]
7+
= Recovering Loki pods from failed zones
8+
9+
In {product-title} a zone failure happens when specific availability zone resources become inaccessible. Availability zones are isolated areas within a cloud provider's data center, aimed at enhancing redundancy and fault tolerance. If your {product-title} cluster isn't configured to handle this, a zone failure can lead to service or data loss.
10+
11+
Loki pods are part of a link:https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/[StatefulSet], and they come with Persistent Volume Claims (PVCs) provisioned by a `StorageClass` object. Each Loki pod and its PVCs reside in the same zone. When a zone failure occurs in a cluster, the StatefulSet controller automatically attempts to recover the affected pods in the failed zone.
12+
13+
[WARNING]
14+
====
15+
The following procedure will delete the PVCs in the failed zone, and all data contained therein. To avoid complete data loss the replication factor field of the `LokiStack` CR should always be set to a value greater than 1 to ensure that Loki is replicating.
16+
====
17+
18+
.Prerequisites
19+
* Logging version 5.8 or later.
20+
* Verify your `LokiStack` CR has a replication factor greater than 1.
21+
* Zone failure detected by the control plane, and nodes in the failed zone are marked by cloud provider integration.
22+
23+
The StatefulSet controller automatically attempts to reschedule pods in a failed zone. Because the associated PVCs are also in the failed zone, automatic rescheduling to a different zone does not work. You must manually delete the PVCs in the failed zone to allow successful re-creation of the stateful Loki Pod and its provisioned PVC in the new zone.
24+
25+
26+
.Procedure
27+
. List the pods in `Pending` status by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
oc get pods --field-selector status.phase==Pending -n openshift-logging
32+
----
33+
+
34+
.Example `oc get pods` output
35+
[source,terminal]
36+
----
37+
NAME READY STATUS RESTARTS AGE # <1>
38+
logging-loki-index-gateway-1 0/1 Pending 0 17m
39+
logging-loki-ingester-1 0/1 Pending 0 16m
40+
logging-loki-ruler-1 0/1 Pending 0 16m
41+
----
42+
<1> These pods are in `Pending` status because their corresponding PVCs are in the failed zone.
43+
44+
.. List the PVCs in `Pending` status by running the following command:
45+
+
46+
[source,terminal]
47+
----
48+
oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r
49+
----
50+
+
51+
.Example `oc get pvc` output
52+
[source,terminal]
53+
----
54+
storage-logging-loki-index-gateway-1
55+
storage-logging-loki-ingester-1
56+
wal-logging-loki-ingester-1
57+
storage-logging-loki-ruler-1
58+
wal-logging-loki-ruler-1
59+
----
60+
61+
.. Delete the PVC(s) for a pod by running the following command:
62+
+
63+
[source,terminal]
64+
----
65+
oc delete pvc __<pvc_name>__ -n openshift-logging
66+
----
67+
+
68+
.. Then delete the pod(s) by running the following command:
69+
[source,terminal]
70+
----
71+
oc delete pod __<pod_name>__ -n openshift-logging
72+
----
73+
74+
Once these objects have been successfully deleted, they should automatically be rescheduled in an available zone.
75+
76+
[id="logging-loki-zone-fail-term-state_{context}"]
77+
== Troubleshooting PVC in a terminating state
78+
79+
The PVCs might hang in the terminating state without being deleted, if PVC metadata finalizers are set to `kubernetes.io/pv-protection`. Removing the finalizers should allow the PVCs to delete successfully.
80+
81+
. Remove the finalizer for each PVC by running the command below, then retry deletion.
82+
83+
[source,terminal]
84+
----
85+
oc patch pvc __<pvc_name>__ -p '{"metadata":{"finalizers":null}}' -n openshift-logging
86+
----

0 commit comments

Comments
 (0)