Skip to content

Commit 6578420

Browse files
committed
Add documentation about PDBs
1 parent b112848 commit 6578420

File tree

1 file changed

+33
-0
lines changed

1 file changed

+33
-0
lines changed

modules/concepts/pages/operations/pod_disruptions.adoc

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
= Allowed Pod disruptions
22
:k8s-pdb: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
3+
:commons-operator: xref:commons-operator:index.adoc
34
:description: Configure PodDisruptionBudgets (PDBs) to minimize planned downtime for Stackable products. Default values are based on fault tolerance and can be customized.
45

56
Any downtime of our products is generally considered to be bad.
@@ -117,3 +118,35 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
117118

118119
== Details
119120
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
121+
122+
== Known issues with PDBs and e.g. certificate rotations
123+
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates.
124+
125+
NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.
126+
127+
=== Workaround
128+
If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`.
129+
130+
==== k9s
131+
If you are using `k9s` you can start it in your terminal
132+
[source, bash]
133+
----
134+
k9s
135+
----
136+
and type `0` to view all namespaces and then type e.g. `/zookeeper` and hit enter. Go with up and down to your pod and press `CTL + D` and confirm to delete the pod. Repeat with all other instances of the stuck product.
137+
138+
==== kubectl
139+
List your pods with
140+
[source, bash]
141+
----
142+
kubectl get pods -A
143+
----
144+
and copy the name of the pod you want to delete. Type
145+
[source, bash]
146+
----
147+
kubectl delete pod zookeeper-server-default-0
148+
----
149+
to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product.
150+
151+
=== Preventing this Situation entirely
152+
Only disabling PDBs can guarantee to not run into this scenario.

0 commit comments

Comments
 (0)