-
-
Notifications
You must be signed in to change notification settings - Fork 14
Add: Pod Disruption Budget violations workload playbook #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
66d63de
71bc24e
9000291
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| --- | ||
| title: Pod Disruption Budget Violations - Workload | ||
| weight: 250 | ||
| categories: | ||
| - kubernetes | ||
| - workload | ||
| - pod-disruption-budget | ||
| --- | ||
|
|
||
| # Pod Disruption Budget Violations - Workload | ||
|
|
||
| ## Meaning | ||
|
|
||
| Pod Disruption Budget (PDB) violations happen when Kubernetes blocks voluntary disruptions because evicting a pod would break the availability guarantees defined in a PodDisruptionBudget. This commonly shows up during node drain operations, cluster autoscaler scale-down, or rolling updates when pods cannot be evicted safely. A typical symptom is that `disruptionsAllowed` is `0` for a PDB protecting the affected workload, so voluntary evictions stall. This is usually caused by an overly strict `minAvailable` / `maxUnavailable`, unhealthy pods reducing `currentHealthy`, or selector issues (PDB selecting the wrong pods or not matching the intended pods). For PDB concepts and behavior, refer to [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/). | ||
|
|
||
| ## Impact | ||
|
|
||
| Node maintenance can be blocked because draining `<node-name>` hangs or fails when protected pods cannot be evicted without violating the PDB. Rolling updates for `<deployment-name>` / `<statefulset-name>` can stall because the cluster cannot safely remove old pods while preserving the minimum healthy count. Cluster autoscaler may be unable to scale down nodes, which increases cost and can keep workloads on suboptimal or degraded nodes. During incidents, this delays remediation steps like node replacement, OS patching, or rescheduling away from unhealthy hardware. | ||
|
|
||
| ## Playbook | ||
|
|
||
| 1. Describe the PodDisruptionBudget `<pdb-name>` in namespace `<namespace>` and note `minAvailable` / `maxUnavailable`, selector, and status fields (`currentHealthy`, `desiredHealthy`, `disruptionsAllowed`). | ||
| 2. List pods in namespace `<namespace>` that match the PDB selector and identify which pods are `Ready` vs not `Ready`, including restart spikes or readiness probe failures. | ||
| 3. Retrieve recent events in namespace `<namespace>` and look for eviction/drain messages that mention PDB blocking disruptions, noting timestamps for correlation. | ||
| 4. If the issue is related to a rollout, check rollout status for `<deployment-name>` or `<statefulset-name>` in namespace `<namespace>` and identify whether it is waiting for pods to become `Ready` or unable to terminate old pods safely. | ||
| 5. If the issue is related to node maintenance, inspect which pods on `<node-name>` are protected by the PDB and confirm whether draining `<node-name>` would require evicting pods that would reduce healthy replicas below the PDB threshold. | ||
| 6. Check replica counts for the workload (`replicas`, `readyReplicas`, `availableReplicas`) and confirm whether there are enough replicas to allow at least one voluntary disruption. | ||
| 7. List all PDBs in namespace `<namespace>` and verify there are no overlapping PDB selectors protecting the same pods with incompatible constraints. | ||
| 8. If cluster autoscaler scale-down is impacted, review autoscaler events/log messages around the same time window to confirm that PDB constraints are preventing node removal. | ||
| 9. Validate selector correctness by comparing `spec.selector.matchLabels` of `<pdb-name>` with actual pod labels for the intended workload, and ensure the PDB is protecting the right pods. | ||
| 10. If disruptions must proceed urgently, coordinate with the service owner to temporarily relax the PDB or increase replicas for the workload so that `disruptionsAllowed` becomes greater than `0`, then re-run the drain/rollout safely. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. If `disruptionsAllowed` is `0` and `currentHealthy` is at the minimum required by `minAvailable`, then the PDB is actively preventing any voluntary evictions until healthy capacity increases. | ||
| 2. If events show eviction failures at time `T` and a node drain or rollout started at approximately time `T`, then the PDB constraint is the likely blocker for that operation. | ||
| 3. If the workload has `replicas = N` but only `readyReplicas < N` (or readiness is flapping), then `currentHealthy` drops and the PDB will block disruptions until pod health stabilizes. | ||
| 4. If draining `<node-name>` is blocked and the protected pods on `<node-name>` match the PDB selector, then the drain is waiting on PDB safety checks and will not proceed without additional healthy replicas elsewhere. | ||
| 5. If multiple PDBs in `<namespace>` match the same pods and each enforces strict availability, then combined constraints can reduce or eliminate allowed disruptions even if each PDB looks reasonable alone. | ||
| 6. If selector validation shows the PDB matches unexpected pods (or misses intended pods), then update the selector to target only the correct workload and re-check `disruptionsAllowed`. | ||
| 7. If autoscaler scale-down messages align with PDB blocking and the pods on candidate nodes are protected, then autoscaler cannot remove the node without violating the PDB, so scale-down will remain blocked. | ||
|
|
||
| Inline references: | ||
| - [Kubernetes PDB documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) | ||
| - [PDB API reference](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-disruption-budget-v1/) | ||
| - [kubectl drain behavior](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) | ||
| - [Cluster Autoscaler FAQ β PDB interaction](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md) | ||
|
Comment on lines
+1
to
+47
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. π§© Analysis chainπ Script executed: #!/bin/bash
# Verify whether a README entry for this playbook already exists.
rg -n "PodDisruptionBudget\|PDB" README.mdRepository: Scoutflo/Scoutflo-SRE-Playbooks Length of output: 57 π Script executed: #!/bin/bash
# Check the structure of the repository and locate README files
echo "=== Repository structure ==="
git ls-files | head -30
echo -e "\n=== K8s Playbooks structure ==="
git ls-files "K8s Playbooks/" | head -20
echo -e "\n=== Check for README files in K8s Playbooks ==="
find . -path "*K8s Playbooks*" -name "README*" -o -path "*K8s Playbooks*" -name "*.md" | head -20Repository: Scoutflo/Scoutflo-SRE-Playbooks Length of output: 4293 π Script executed: #!/bin/bash
# Examine README.md structure and content length
echo "=== README.md content (first 100 lines) ==="
head -100 README.md
echo -e "\n=== Check for playbook references in README.md ==="
rg -n "Playbook\|workload\|K8s Playbooks" README.md | head -30Repository: Scoutflo/Scoutflo-SRE-Playbooks Length of output: 5307 π Script executed: #!/bin/bash
# Get the Contents section of README.md
echo "=== README.md Contents section ==="
sed -n '/^## Contents/,/^## [A-Z]/p' README.md | head -200
echo -e "\n=== List files in K8s Playbooks/04-Workloads/ ==="
git ls-files "K8s Playbooks/04-Workloads/" | sort
echo -e "\n=== Check K8s Playbooks/04-Workloads/README.md ==="
cat -n "K8s Playbooks/04-Workloads/README.md" 2>/dev/null || echo "File not found"Repository: Scoutflo/Scoutflo-SRE-Playbooks Length of output: 14851 π Script executed: #!/bin/bash
# Search README.md for "04-Workloads" to see if it's referenced
rg -n "04-Workloads|Workloads" README.mdRepository: Scoutflo/Scoutflo-SRE-Playbooks Length of output: 218 Add README.md index entries for the PodDisruptionBudget playbook. The playbook file exists but is missing from the documentation indexes:
Add entries to both files to maintain consistency with other playbooks in the repository. π§° Toolsπͺ LanguageTool[style] ~37-~37: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) [style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) [style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) [style] ~40-~40: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) [style] ~41-~41: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) π€ Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
Repository: Scoutflo/Scoutflo-SRE-Playbooks
Length of output: 1400
π Script executed:
Repository: Scoutflo/Scoutflo-SRE-Playbooks
Length of output: 1266
π Script executed:
Repository: Scoutflo/Scoutflo-SRE-Playbooks
Length of output: 5786
π Script executed:
Repository: Scoutflo/Scoutflo-SRE-Playbooks
Length of output: 5390
π Script executed:
Repository: Scoutflo/Scoutflo-SRE-Playbooks
Length of output: 4073
Add concrete
kubectlcommands to each playbook step.Every step describes what to inspect but omits the actual command needed to do so. During a live incident an operator must be able to copy-paste commands directly from the runbook without mentally translating prose into CLI syntax. Other playbooks in the repository include
kubectlcommands in their diagnostic guidance; this one is missing them entirely. Below is a representative mapping for each step:kubectl describe pdb <pdb-name> -n <namespace>kubectl get pods -n <namespace> -l <label-selector> -o widekubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i evictkubectl rollout status deployment/<deployment-name> -n <namespace>kubectl get pods -n <namespace> -o wide --field-selector spec.nodeName=<node-name>kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.status}'kubectl get pdb -n <namespace>kubectl get events -n kube-system --sort-by='.lastTimestamp' | grep -i scalekubectl get pdb <pdb-name> -n <namespace> -o jsonpath='{.spec.selector}'kubectl patch pdb <pdb-name> -n <namespace> -p '{"spec":{"minAvailable":0}}'Note for Step 9: PDB selectors can also use
matchExpressionsin addition tomatchLabels; both should be compared against pod labels when validating the selector.π€ Prompt for AI Agents