Skip to content

Add known issue on PDBs and certificate rotation #769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 5, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions modules/concepts/pages/operations/pod_disruptions.adoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
= Allowed Pod disruptions
:k8s-pdb: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
:commons-operator: xref:commons-operator:index.adoc
:description: Configure PodDisruptionBudgets (PDBs) to minimize planned downtime for Stackable products. Default values are based on fault tolerance and can be customized.

Any downtime of our products is generally considered to be bad.
Expand Down Expand Up @@ -117,3 +118,35 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow

== Details
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.

== Known issues with PDBs and e.g. certificate rotations
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates.

NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.

=== Workaround
If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`.

==== k9s
If you are using `k9s` you can start it in your terminal
[source, bash]
----
k9s
----
and type `0` to view all namespaces and then type e.g. `/zookeeper` and hit enter. Go with up and down to your pod and press `CTL + D` and confirm to delete the pod. Repeat with all other instances of the stuck product.

==== kubectl
List your pods with
[source, bash]
----
kubectl get pods -A
----
and copy the name of the pod you want to delete. Type
[source, bash]
----
kubectl delete pod zookeeper-server-default-0
----
to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product.

=== Preventing this Situation entirely
Only disabling PDBs can guarantee to not run into this scenario.
Loading