Skip to content

Commit 5912569

Browse files
Malewaresbernauer
andauthored
Requested Changes
Co-authored-by: Sebastian Bernauer <[email protected]>
1 parent 56646d6 commit 5912569

File tree

1 file changed

+17
-5
lines changed

1 file changed

+17
-5
lines changed

modules/concepts/pages/operations/pod_disruptions.adoc

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -119,13 +119,22 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
119119
== Details
120120
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
121121

122-
== Known issues with PDBs and e.g. certificate rotations
123-
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates.
122+
== Known issue with PDBs and certificate rotations
123+
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire.
124+
commons-operator uses the `evict` API in Kubernetes, which respects the PDB.
125+
If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted.
126+
Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable.
127+
As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates.
128+
As only the container crash-loops (not the entire Pod), no new certificate is issued.
129+
As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod.
130+
However, this is prohibited, as the PDB would be violated.
124131

125132
NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.
126133

127134
=== Workaround
128-
If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`.
135+
If you encounter this only manually deleting those pods can help out of this situation.
136+
A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway.
137+
All restarted Pods will get a new certificate, the stacklet should turn healthy again.
129138

130139
==== k9s
131140
If you are using `k9s` you can start it in your terminal
@@ -148,5 +157,8 @@ kubectl delete pod zookeeper-server-default-0
148157
----
149158
to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product.
150159

151-
=== Preventing this Situation entirely
152-
Only disabling PDBs can guarantee to not run into this scenario.
160+
=== Preventing this situation
161+
The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire.
162+
163+
A hacky way to prevent this situation could be to disable PDBs for the specific stacklet.
164+
But this also has the downside, that you are now missing the benefits of the PDB.

0 commit comments

Comments
 (0)