From 66d63de4408d4b35c8a7af972b9a6fbac065d7c8 Mon Sep 17 00:00:00 2001 From: ashishsarkar Date: Thu, 19 Feb 2026 22:51:39 +0530 Subject: [PATCH 1/3] Fix: [Good First Issue] Add playbook for Kubernetes Pod Disruption Budget violations #2 --- ...PodDisruptionBudget-Violations-workload.md | 69 +++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md diff --git a/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md b/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md new file mode 100644 index 0000000..dce9e49 --- /dev/null +++ b/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md @@ -0,0 +1,69 @@ +--- +title: Pod Disruption Budget Violations - Workload +weight: 250 +categories: + - kubernetes + - workload + - pod-disruption-budget +--- + +# PodDisruptionBudget-Violations-workload + +## Meaning + +Pod Disruption Budget (PDB) violations occur when voluntary disruptions (node drains, pod evictions, rolling updates) cannot proceed because they would violate the minimum availability constraints defined in a PodDisruptionBudget resource. This manifests as eviction failures, node drain operations hanging or failing, deployment rollouts stuck, or cluster autoscaler unable to remove nodes. PDBs protect application availability by ensuring a minimum number of pods remain available during voluntary disruptions, but misconfigured PDBs (too restrictive minAvailable/maxUnavailable values, incorrect label selectors, or overlapping PDBs) can prevent necessary maintenance operations, scaling actions, or deployments from completing. This affects the workload plane and indicates availability protection constraints blocking voluntary pod disruptions, typically caused by overly restrictive PDB configurations, PDB label selector mismatches, or PDB conflicts with scaling/deployment operations. + +## Impact + +PDB violation alerts fire; node drain operations fail or hang indefinitely; deployment rollouts cannot proceed; pods cannot be evicted for node maintenance; cluster autoscaler cannot scale down nodes; rolling updates are blocked; voluntary disruptions are prevented; maintenance windows cannot be executed; pods remain on unhealthy nodes; system updates and node replacements are delayed. Node drain operations show "cannot drain node" errors due to PDB violations; deployment rollouts show pods stuck in Terminating state; eviction API calls return 429 Too Many Requests errors; cluster autoscaler logs show scale-down blocked by PDB constraints; voluntary pod disruptions are prevented even when nodes need maintenance. + +## Playbook + +1. Describe PodDisruptionBudget resources in namespace `` to see: + - PDB configuration (minAvailable, maxUnavailable) + - Label selectors matching pods + - Current status (disruptionsAllowed, desiredHealthy, currentHealthy) + - Conditions showing PDB health + +2. List pods in namespace `` matching PDB label selectors to verify which pods are protected by the PDB and their current status. + +3. Retrieve events in namespace `` filtered by involved object name `` and sorted by last timestamp to see PDB violation events and eviction failures. + +4. Describe deployment `` or statefulset `` in namespace `` to check if rollout is stuck due to PDB constraints preventing pod termination. + +5. Check node drain status for node `` if node maintenance is blocked, verifying if drain operation is hanging due to PDB violations. + +6. List all PodDisruptionBudget resources across all namespaces to identify overlapping PDBs that may be protecting the same pods with conflicting constraints. + +7. Retrieve cluster autoscaler logs or events if scale-down operations are blocked, checking for PDB-related scale-down prevention messages. + +8. Describe pods in namespace `` that are stuck in Terminating state to verify if PDB constraints are preventing their eviction. + +9. Check eviction API response for pod `` in namespace `` to see if eviction is blocked due to PDB violations (expect 429 status code). + +10. Verify PDB label selector matches pod labels correctly by comparing PDB spec.selector.matchLabels with actual pod labels. + +## Diagnosis + +1. Analyze PDB status from Playbook step 1 to identify violation patterns. If `disruptionsAllowed` is 0 and `currentHealthy` equals `minAvailable`, the PDB is preventing any voluntary disruptions. If `disruptionsAllowed` is negative, the PDB configuration is invalid or pods are unhealthy. + +2. If events from Playbook step 3 show "eviction blocked by PodDisruptionBudget" or similar messages, correlate the event timestamp with when node drain, deployment rollout, or eviction operations started. This establishes that PDB constraints are actively blocking disruptions. + +3. If deployment rollout is stuck (from Playbook step 4) and pods remain in Terminating state, check if the PDB's `minAvailable` or `maxUnavailable` values prevent terminating old pods. For example, if `minAvailable: 2` and only 2 pods exist, no pods can be terminated during rollout. + +4. If node drain operations are hanging (from Playbook step 5), verify that pods on the node match PDB selectors and that evicting them would violate PDB constraints. Check if `disruptionsAllowed` is 0 for the relevant PDBs protecting pods on that node. + +5. If multiple PDBs are found (from Playbook step 6) with overlapping label selectors protecting the same pods, verify if their combined constraints create impossible requirements. For example, two PDBs each requiring `minAvailable: 2` for the same 3 pods would allow only 1 disruption total, potentially blocking necessary operations. + +6. If cluster autoscaler scale-down is blocked (from Playbook step 7), check if pods on nodes targeted for removal are protected by PDBs that prevent their eviction. Autoscaler cannot remove nodes if doing so would violate PDB constraints. + +7. If pods are stuck in Terminating state (from Playbook step 8), verify if PDB constraints are preventing final pod deletion. Check pod finalizers and PDB status to determine if PDB is the blocker versus other finalizers. + +8. If eviction API returns 429 Too Many Requests (from Playbook step 9), this confirms PDB is actively blocking eviction. Check the response body for details about which PDB constraint is being violated. + +9. If PDB label selector mismatch is suspected (from Playbook step 10), verify that pods intended to be protected actually match the PDB's selector. Mismatched selectors may cause PDB to protect wrong pods or fail to protect intended pods. + +10. If PDB configuration appears correct but violations persist, check for gradual pod health degradation. If `currentHealthy` pods drop below `minAvailable` due to pod failures, `disruptionsAllowed` becomes 0, preventing any voluntary disruptions until pod health is restored. + +**If no correlation is found**: Extend analysis to check for PDB configuration changes around the incident time, verify if PDB was recently created or modified, check for namespace-level resource quotas affecting pod creation, examine if pod readiness probe failures are reducing healthy pod count below PDB minimums, review historical PDB status to identify gradual constraint tightening, and verify if multiple workloads sharing nodes are creating cumulative PDB constraints that block node-level operations. + From 71bc24e45850f2d1a5927ea8ad4edbe7c0246cfe Mon Sep 17 00:00:00 2001 From: ashishsarkar Date: Thu, 19 Feb 2026 23:45:54 +0530 Subject: [PATCH 2/3] Fix: [Good First Issue] Add playbook for Kubernetes Pod Disruption Budget violations Scoutflo#2 --- ...PodDisruptionBudget-Violations-workload.md | 69 ------------------- .../PodDisruptionBudgetViolations-workload.md | 47 +++++++++++++ 2 files changed, 47 insertions(+), 69 deletions(-) delete mode 100644 K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md create mode 100644 K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md diff --git a/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md b/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md deleted file mode 100644 index dce9e49..0000000 --- a/K8s Playbooks/04-Workloads/PodDisruptionBudget-Violations-workload.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: Pod Disruption Budget Violations - Workload -weight: 250 -categories: - - kubernetes - - workload - - pod-disruption-budget ---- - -# PodDisruptionBudget-Violations-workload - -## Meaning - -Pod Disruption Budget (PDB) violations occur when voluntary disruptions (node drains, pod evictions, rolling updates) cannot proceed because they would violate the minimum availability constraints defined in a PodDisruptionBudget resource. This manifests as eviction failures, node drain operations hanging or failing, deployment rollouts stuck, or cluster autoscaler unable to remove nodes. PDBs protect application availability by ensuring a minimum number of pods remain available during voluntary disruptions, but misconfigured PDBs (too restrictive minAvailable/maxUnavailable values, incorrect label selectors, or overlapping PDBs) can prevent necessary maintenance operations, scaling actions, or deployments from completing. This affects the workload plane and indicates availability protection constraints blocking voluntary pod disruptions, typically caused by overly restrictive PDB configurations, PDB label selector mismatches, or PDB conflicts with scaling/deployment operations. - -## Impact - -PDB violation alerts fire; node drain operations fail or hang indefinitely; deployment rollouts cannot proceed; pods cannot be evicted for node maintenance; cluster autoscaler cannot scale down nodes; rolling updates are blocked; voluntary disruptions are prevented; maintenance windows cannot be executed; pods remain on unhealthy nodes; system updates and node replacements are delayed. Node drain operations show "cannot drain node" errors due to PDB violations; deployment rollouts show pods stuck in Terminating state; eviction API calls return 429 Too Many Requests errors; cluster autoscaler logs show scale-down blocked by PDB constraints; voluntary pod disruptions are prevented even when nodes need maintenance. - -## Playbook - -1. Describe PodDisruptionBudget resources in namespace `` to see: - - PDB configuration (minAvailable, maxUnavailable) - - Label selectors matching pods - - Current status (disruptionsAllowed, desiredHealthy, currentHealthy) - - Conditions showing PDB health - -2. List pods in namespace `` matching PDB label selectors to verify which pods are protected by the PDB and their current status. - -3. Retrieve events in namespace `` filtered by involved object name `` and sorted by last timestamp to see PDB violation events and eviction failures. - -4. Describe deployment `` or statefulset `` in namespace `` to check if rollout is stuck due to PDB constraints preventing pod termination. - -5. Check node drain status for node `` if node maintenance is blocked, verifying if drain operation is hanging due to PDB violations. - -6. List all PodDisruptionBudget resources across all namespaces to identify overlapping PDBs that may be protecting the same pods with conflicting constraints. - -7. Retrieve cluster autoscaler logs or events if scale-down operations are blocked, checking for PDB-related scale-down prevention messages. - -8. Describe pods in namespace `` that are stuck in Terminating state to verify if PDB constraints are preventing their eviction. - -9. Check eviction API response for pod `` in namespace `` to see if eviction is blocked due to PDB violations (expect 429 status code). - -10. Verify PDB label selector matches pod labels correctly by comparing PDB spec.selector.matchLabels with actual pod labels. - -## Diagnosis - -1. Analyze PDB status from Playbook step 1 to identify violation patterns. If `disruptionsAllowed` is 0 and `currentHealthy` equals `minAvailable`, the PDB is preventing any voluntary disruptions. If `disruptionsAllowed` is negative, the PDB configuration is invalid or pods are unhealthy. - -2. If events from Playbook step 3 show "eviction blocked by PodDisruptionBudget" or similar messages, correlate the event timestamp with when node drain, deployment rollout, or eviction operations started. This establishes that PDB constraints are actively blocking disruptions. - -3. If deployment rollout is stuck (from Playbook step 4) and pods remain in Terminating state, check if the PDB's `minAvailable` or `maxUnavailable` values prevent terminating old pods. For example, if `minAvailable: 2` and only 2 pods exist, no pods can be terminated during rollout. - -4. If node drain operations are hanging (from Playbook step 5), verify that pods on the node match PDB selectors and that evicting them would violate PDB constraints. Check if `disruptionsAllowed` is 0 for the relevant PDBs protecting pods on that node. - -5. If multiple PDBs are found (from Playbook step 6) with overlapping label selectors protecting the same pods, verify if their combined constraints create impossible requirements. For example, two PDBs each requiring `minAvailable: 2` for the same 3 pods would allow only 1 disruption total, potentially blocking necessary operations. - -6. If cluster autoscaler scale-down is blocked (from Playbook step 7), check if pods on nodes targeted for removal are protected by PDBs that prevent their eviction. Autoscaler cannot remove nodes if doing so would violate PDB constraints. - -7. If pods are stuck in Terminating state (from Playbook step 8), verify if PDB constraints are preventing final pod deletion. Check pod finalizers and PDB status to determine if PDB is the blocker versus other finalizers. - -8. If eviction API returns 429 Too Many Requests (from Playbook step 9), this confirms PDB is actively blocking eviction. Check the response body for details about which PDB constraint is being violated. - -9. If PDB label selector mismatch is suspected (from Playbook step 10), verify that pods intended to be protected actually match the PDB's selector. Mismatched selectors may cause PDB to protect wrong pods or fail to protect intended pods. - -10. If PDB configuration appears correct but violations persist, check for gradual pod health degradation. If `currentHealthy` pods drop below `minAvailable` due to pod failures, `disruptionsAllowed` becomes 0, preventing any voluntary disruptions until pod health is restored. - -**If no correlation is found**: Extend analysis to check for PDB configuration changes around the incident time, verify if PDB was recently created or modified, check for namespace-level resource quotas affecting pod creation, examine if pod readiness probe failures are reducing healthy pod count below PDB minimums, review historical PDB status to identify gradual constraint tightening, and verify if multiple workloads sharing nodes are creating cumulative PDB constraints that block node-level operations. - diff --git a/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md b/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md new file mode 100644 index 0000000..8da9ed6 --- /dev/null +++ b/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md @@ -0,0 +1,47 @@ +--- +title: Pod Disruption Budget Violations - Workload +weight: 250 +categories: + - kubernetes + - workload + - pod-disruption-budget +--- + +# Pod Disruption Budget Violations - Workload + +## Meaning + +Pod Disruption Budget (PDB) violations happen when Kubernetes blocks voluntary disruptions because evicting a pod would break the availability guarantees defined in a PodDisruptionBudget. This commonly shows up during node drain operations, cluster autoscaler scale-down, or rolling updates when pods cannot be evicted safely. A typical symptom is that `disruptionsAllowed` is `0` for a PDB protecting the affected workload, so voluntary evictions stall. This is usually caused by an overly strict `minAvailable` / `maxUnavailable`, unhealthy pods reducing `currentHealthy`, or selector issues (PDB selecting the wrong pods or not matching the intended pods). For PDB concepts and behavior, refer to Kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/ + +## Impact + +Node maintenance can be blocked because draining `` hangs or fails when protected pods cannot be evicted without violating the PDB. Rolling updates for `` / `` can stall because the cluster cannot safely remove old pods while preserving the minimum healthy count. Cluster autoscaler may be unable to scale down nodes, which increases cost and can keep workloads on suboptimal or degraded nodes. During incidents, this delays remediation steps like node replacement, OS patching, or rescheduling away from unhealthy hardware. + +## Playbook + +1. Describe the PodDisruptionBudget `` in namespace `` and note `minAvailable` / `maxUnavailable`, selector, and status fields (`currentHealthy`, `desiredHealthy`, `disruptionsAllowed`). +2. List pods in namespace `` that match the PDB selector and identify which pods are `Ready` vs not `Ready`, including restart spikes or readiness probe failures. +3. Retrieve recent events in namespace `` and look for eviction/drain messages that mention PDB blocking disruptions, noting timestamps for correlation. +4. If the issue is related to a rollout, check rollout status for `` or `` in namespace `` and identify whether it is waiting for pods to become `Ready` or unable to terminate old pods safely. +5. If the issue is related to node maintenance, inspect which pods on `` are protected by the PDB and confirm whether draining `` would require evicting pods that would reduce healthy replicas below the PDB threshold. +6. Check replica counts for the workload (`replicas`, `readyReplicas`, `availableReplicas`) and confirm whether there are enough replicas to allow at least one voluntary disruption. +7. List all PDBs in namespace `` and verify there are no overlapping PDB selectors protecting the same pods with incompatible constraints. +8. If cluster autoscaler scale-down is impacted, review autoscaler events/log messages around the same time window to confirm that PDB constraints are preventing node removal. +9. Validate selector correctness by comparing `spec.selector.matchLabels` of `` with actual pod labels for the intended workload, and ensure the PDB is protecting the right pods. +10. If disruptions must proceed urgently, coordinate with the service owner to temporarily relax the PDB or increase replicas for the workload so that `disruptionsAllowed` becomes greater than `0`, then re-run the drain/rollout safely. + +## Diagnosis + +1. If `disruptionsAllowed` is `0` and `currentHealthy` is at the minimum required by `minAvailable`, then the PDB is actively preventing any voluntary evictions until healthy capacity increases. +2. If events show eviction failures at time `T` and a node drain or rollout started at approximately time `T`, then the PDB constraint is the likely blocker for that operation. +3. If the workload has `replicas = N` but only `readyReplicas < N` (or readiness is flapping), then `currentHealthy` drops and the PDB will block disruptions until pod health stabilizes. +4. If draining `` is blocked and the protected pods on `` match the PDB selector, then the drain is waiting on PDB safety checks and will not proceed without additional healthy replicas elsewhere. +5. If multiple PDBs in `` match the same pods and each enforces strict availability, then combined constraints can reduce or eliminate allowed disruptions even if each PDB looks reasonable alone. +6. If selector validation shows the PDB matches unexpected pods (or misses intended pods), then update the selector to target only the correct workload and re-check `disruptionsAllowed`. +7. If autoscaler scale-down messages align with PDB blocking and the pods on candidate nodes are protected, then autoscaler cannot remove the node without violating the PDB, so scale-down will remain blocked. + +Inline references: +- Kubernetes PDB documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/ +- PDB API reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-disruption-budget-v1/ +- kubectl drain behavior: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ +- Cluster Autoscaler scale-down concepts: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler From 9000291671f6fa6efe62dadb8365c14751d82380 Mon Sep 17 00:00:00 2001 From: ashishsarkar Date: Fri, 20 Feb 2026 00:15:53 +0530 Subject: [PATCH 3/3] added changes to documents - Rectified as per CodeRabbit comments --- .../PodDisruptionBudgetViolations-workload.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md b/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md index 8da9ed6..805b075 100644 --- a/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md +++ b/K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md @@ -11,7 +11,7 @@ categories: ## Meaning -Pod Disruption Budget (PDB) violations happen when Kubernetes blocks voluntary disruptions because evicting a pod would break the availability guarantees defined in a PodDisruptionBudget. This commonly shows up during node drain operations, cluster autoscaler scale-down, or rolling updates when pods cannot be evicted safely. A typical symptom is that `disruptionsAllowed` is `0` for a PDB protecting the affected workload, so voluntary evictions stall. This is usually caused by an overly strict `minAvailable` / `maxUnavailable`, unhealthy pods reducing `currentHealthy`, or selector issues (PDB selecting the wrong pods or not matching the intended pods). For PDB concepts and behavior, refer to Kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/ +Pod Disruption Budget (PDB) violations happen when Kubernetes blocks voluntary disruptions because evicting a pod would break the availability guarantees defined in a PodDisruptionBudget. This commonly shows up during node drain operations, cluster autoscaler scale-down, or rolling updates when pods cannot be evicted safely. A typical symptom is that `disruptionsAllowed` is `0` for a PDB protecting the affected workload, so voluntary evictions stall. This is usually caused by an overly strict `minAvailable` / `maxUnavailable`, unhealthy pods reducing `currentHealthy`, or selector issues (PDB selecting the wrong pods or not matching the intended pods). For PDB concepts and behavior, refer to [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/). ## Impact @@ -41,7 +41,7 @@ Node maintenance can be blocked because draining `` hangs or fails wh 7. If autoscaler scale-down messages align with PDB blocking and the pods on candidate nodes are protected, then autoscaler cannot remove the node without violating the PDB, so scale-down will remain blocked. Inline references: -- Kubernetes PDB documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/ -- PDB API reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-disruption-budget-v1/ -- kubectl drain behavior: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ -- Cluster Autoscaler scale-down concepts: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler +- [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) +- [PDB API reference](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-disruption-budget-v1/) +- [kubectl drain behavior](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) +- [Cluster Autoscaler FAQ – PDB interaction](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)