Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
title: Pod Disruption Budget Violations - Workload
weight: 250
categories:
- kubernetes
- workload
- pod-disruption-budget
---

# Pod Disruption Budget Violations - Workload

## Meaning

Pod Disruption Budget (PDB) violations happen when Kubernetes blocks voluntary disruptions because evicting a pod would break the availability guarantees defined in a PodDisruptionBudget. This commonly shows up during node drain operations, cluster autoscaler scale-down, or rolling updates when pods cannot be evicted safely. A typical symptom is that `disruptionsAllowed` is `0` for a PDB protecting the affected workload, so voluntary evictions stall. This is usually caused by an overly strict `minAvailable` / `maxUnavailable`, unhealthy pods reducing `currentHealthy`, or selector issues (PDB selecting the wrong pods or not matching the intended pods). For PDB concepts and behavior, refer to [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/).

## Impact

Node maintenance can be blocked because draining `<node-name>` hangs or fails when protected pods cannot be evicted without violating the PDB. Rolling updates for `<deployment-name>` / `<statefulset-name>` can stall because the cluster cannot safely remove old pods while preserving the minimum healthy count. Cluster autoscaler may be unable to scale down nodes, which increases cost and can keep workloads on suboptimal or degraded nodes. During incidents, this delays remediation steps like node replacement, OS patching, or rescheduling away from unhealthy hardware.

## Playbook

1. Describe the PodDisruptionBudget `<pdb-name>` in namespace `<namespace>` and note `minAvailable` / `maxUnavailable`, selector, and status fields (`currentHealthy`, `desiredHealthy`, `disruptionsAllowed`).
2. List pods in namespace `<namespace>` that match the PDB selector and identify which pods are `Ready` vs not `Ready`, including restart spikes or readiness probe failures.
3. Retrieve recent events in namespace `<namespace>` and look for eviction/drain messages that mention PDB blocking disruptions, noting timestamps for correlation.
4. If the issue is related to a rollout, check rollout status for `<deployment-name>` or `<statefulset-name>` in namespace `<namespace>` and identify whether it is waiting for pods to become `Ready` or unable to terminate old pods safely.
5. If the issue is related to node maintenance, inspect which pods on `<node-name>` are protected by the PDB and confirm whether draining `<node-name>` would require evicting pods that would reduce healthy replicas below the PDB threshold.
6. Check replica counts for the workload (`replicas`, `readyReplicas`, `availableReplicas`) and confirm whether there are enough replicas to allow at least one voluntary disruption.
7. List all PDBs in namespace `<namespace>` and verify there are no overlapping PDB selectors protecting the same pods with incompatible constraints.
8. If cluster autoscaler scale-down is impacted, review autoscaler events/log messages around the same time window to confirm that PDB constraints are preventing node removal.
9. Validate selector correctness by comparing `spec.selector.matchLabels` of `<pdb-name>` with actual pod labels for the intended workload, and ensure the PDB is protecting the right pods.
10. If disruptions must proceed urgently, coordinate with the service owner to temporarily relax the PDB or increase replicas for the workload so that `disruptionsAllowed` becomes greater than `0`, then re-run the drain/rollout safely.
Comment on lines +22 to +31
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find all markdown playbooks and check which ones contain kubectl commands
find "K8s Playbooks" -type f -name "*.md" | head -20

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 1400


🏁 Script executed:

# Check which playbooks contain kubectl commands
rg "kubectl" --glob "*.md" "K8s Playbooks/" --files-with-matches | head -20

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 1266


🏁 Script executed:

# Read the actual file to see current structure (lines 1-50)
cat -n "K8s Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md" | head -60

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 5786


🏁 Script executed:

# Read a playbook that contains kubectl to see the standard format
cat -n "K8s Playbooks/10-Monitoring-Autoscaling/MetricsServerShowsNoData-monitoring.md" | head -80

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 5390


🏁 Script executed:

# Check another playbook with kubectl commands
cat -n "K8s Playbooks/11-Installation-Setup/HelmReleaseStuckInPending-install.md" | head -80

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 4073


Add concrete kubectl commands to each playbook step.

Every step describes what to inspect but omits the actual command needed to do so. During a live incident an operator must be able to copy-paste commands directly from the runbook without mentally translating prose into CLI syntax. Other playbooks in the repository include kubectl commands in their diagnostic guidance; this one is missing them entirely. Below is a representative mapping for each step:

Step Command
1 – describe PDB kubectl describe pdb <pdb-name> -n <namespace>
2 – list matching pods kubectl get pods -n <namespace> -l <label-selector> -o wide
3 – retrieve events kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i evict
4 – rollout status kubectl rollout status deployment/<deployment-name> -n <namespace>
5 – pods on node kubectl get pods -n <namespace> -o wide --field-selector spec.nodeName=<node-name>
6 – replica counts kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.status}'
7 – list all PDBs kubectl get pdb -n <namespace>
8 – autoscaler events kubectl get events -n kube-system --sort-by='.lastTimestamp' | grep -i scale
9 – selector check kubectl get pdb <pdb-name> -n <namespace> -o jsonpath='{.spec.selector}'
10 – relax PDB kubectl patch pdb <pdb-name> -n <namespace> -p '{"spec":{"minAvailable":0}}'

Note for Step 9: PDB selectors can also use matchExpressions in addition to matchLabels; both should be compared against pod labels when validating the selector.

πŸ€– Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@K8s` Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md around
lines 22 - 31, The playbook steps lack copy-paste kubectl commands; for each
numbered step (1..10) add a concrete kubectl command that operators can run
directly using the placeholders shown (e.g. <pdb-name>, <namespace>,
<label-selector>, <deployment-name>, <statefulset-name>, <node-name>), mapping
step 1 to a describe pdb command for <pdb-name>, step 2 to a get pods with -l
<label-selector> and -o wide, step 3 to get events sorted and filtered for
eviction/drain messages, step 4 to kubectl rollout status for the
deployment/statefulset, step 5 to get pods on <node-name> via --field-selector
spec.nodeName=, step 6 to get deployment/statefulset status (jsonpath for
replicas fields), step 7 to list pdbs in the namespace, step 8 to inspect
autoscaler events in kube-system, step 9 to output the PDB selector (including
matchLabels and matchExpressions) and compare against pod labels, and step 10 to
include the kubectl patch/scale actions (e.g. patch minAvailable or scale
replicas) as explicit commands so the runbook entries are directly runnable.


## Diagnosis

1. If `disruptionsAllowed` is `0` and `currentHealthy` is at the minimum required by `minAvailable`, then the PDB is actively preventing any voluntary evictions until healthy capacity increases.
2. If events show eviction failures at time `T` and a node drain or rollout started at approximately time `T`, then the PDB constraint is the likely blocker for that operation.
3. If the workload has `replicas = N` but only `readyReplicas < N` (or readiness is flapping), then `currentHealthy` drops and the PDB will block disruptions until pod health stabilizes.
4. If draining `<node-name>` is blocked and the protected pods on `<node-name>` match the PDB selector, then the drain is waiting on PDB safety checks and will not proceed without additional healthy replicas elsewhere.
5. If multiple PDBs in `<namespace>` match the same pods and each enforces strict availability, then combined constraints can reduce or eliminate allowed disruptions even if each PDB looks reasonable alone.
6. If selector validation shows the PDB matches unexpected pods (or misses intended pods), then update the selector to target only the correct workload and re-check `disruptionsAllowed`.
7. If autoscaler scale-down messages align with PDB blocking and the pods on candidate nodes are protected, then autoscaler cannot remove the node without violating the PDB, so scale-down will remain blocked.

Inline references:
- [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
- [PDB API reference](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-disruption-budget-v1/)
- [kubectl drain behavior](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/)
- [Cluster Autoscaler FAQ – PDB interaction](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
Comment on lines +1 to +47
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether a README entry for this playbook already exists.
rg -n "PodDisruptionBudget\|PDB" README.md

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Check the structure of the repository and locate README files
echo "=== Repository structure ==="
git ls-files | head -30

echo -e "\n=== K8s Playbooks structure ==="
git ls-files "K8s Playbooks/" | head -20

echo -e "\n=== Check for README files in K8s Playbooks ==="
find . -path "*K8s Playbooks*" -name "README*" -o -path "*K8s Playbooks*" -name "*.md" | head -20

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 4293


🏁 Script executed:

#!/bin/bash
# Examine README.md structure and content length
echo "=== README.md content (first 100 lines) ==="
head -100 README.md

echo -e "\n=== Check for playbook references in README.md ==="
rg -n "Playbook\|workload\|K8s Playbooks" README.md | head -30

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 5307


🏁 Script executed:

#!/bin/bash
# Get the Contents section of README.md
echo "=== README.md Contents section ==="
sed -n '/^## Contents/,/^## [A-Z]/p' README.md | head -200

echo -e "\n=== List files in K8s Playbooks/04-Workloads/ ==="
git ls-files "K8s Playbooks/04-Workloads/" | sort

echo -e "\n=== Check K8s Playbooks/04-Workloads/README.md ==="
cat -n "K8s Playbooks/04-Workloads/README.md" 2>/dev/null || echo "File not found"

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 14851


🏁 Script executed:

#!/bin/bash
# Search README.md for "04-Workloads" to see if it's referenced
rg -n "04-Workloads|Workloads" README.md

Repository: Scoutflo/Scoutflo-SRE-Playbooks

Length of output: 218


Add README.md index entries for the PodDisruptionBudget playbook.

The playbook file exists but is missing from the documentation indexes:

  • Not listed in K8s Playbooks/04-Workloads/README.md (should be playbook #24)
  • Not referenced in the main README.md

Add entries to both files to maintain consistency with other playbooks in the repository.

🧰 Tools
πŸͺ› LanguageTool

[style] ~37-~37: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e likely blocker for that operation. 3. If the workload has replicas = N but onl...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...uptions until pod health stabilizes. 4. If draining <node-name> is blocked and t...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ditional healthy replicas elsewhere. 5. If multiple PDBs in <namespace> match th...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~40-~40: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... if each PDB looks reasonable alone. 6. If selector validation shows the PDB match...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~41-~41: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d and re-check disruptionsAllowed. 7. If autoscaler scale-down messages align wi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

πŸ€– Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@K8s` Playbooks/04-Workloads/PodDisruptionBudgetViolations-workload.md around
lines 1 - 47, The PodDisruptionBudget playbook file is not registered in the
documentation indexes; update the index files by adding a README entry linking
"Pod Disruption Budget Violations - Workload" into "K8s
Playbooks/04-Workloads/README.md" as playbook `#24` (matching the numbering
pattern used by other entries) and add a corresponding link/reference in the
top-level "README.md" so the new playbook appears in the main list; ensure the
link text and path match the playbook title "Pod Disruption Budget Violations -
Workload" and follow the same formatting/style used by the existing playbook
entries.