You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/concepts/pages/operations/pod_disruptions.adoc
+17-5Lines changed: 17 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,13 +119,22 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
119
119
== Details
120
120
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
121
121
122
-
== Known issues with PDBs and e.g. certificate rotations
123
-
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates.
122
+
== Known issue with PDBs and certificate rotations
123
+
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire.
124
+
commons-operator uses the `evict` API in Kubernetes, which respects the PDB.
125
+
If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted.
126
+
Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable.
127
+
As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates.
128
+
As only the container crash-loops (not the entire Pod), no new certificate is issued.
129
+
As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod.
130
+
However, this is prohibited, as the PDB would be violated.
124
131
125
132
NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.
126
133
127
134
=== Workaround
128
-
If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`.
135
+
If you encounter this only manually deleting those pods can help out of this situation.
136
+
A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway.
137
+
All restarted Pods will get a new certificate, the stacklet should turn healthy again.
129
138
130
139
==== k9s
131
140
If you are using `k9s` you can start it in your terminal
@@ -148,5 +157,8 @@ kubectl delete pod zookeeper-server-default-0
148
157
----
149
158
to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product.
150
159
151
-
=== Preventing this Situation entirely
152
-
Only disabling PDBs can guarantee to not run into this scenario.
160
+
=== Preventing this situation
161
+
The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire.
162
+
163
+
A hacky way to prevent this situation could be to disable PDBs for the specific stacklet.
164
+
But this also has the downside, that you are now missing the benefits of the PDB.
0 commit comments