kubectl-explain-failure is a deterministic diagnostic engine that explains why a Kubernetes Pod is failing by constructing structured causal explanations from Kubernetes object state and event timelines.
Kubernetes exposes signals (Pod status, Events, PVC state, Node conditions), but it does not synthesize them into root causes.
This project builds an explicit reasoning layer on top of those signals.
It is a read-only explanation engine, not a controller, not a fixer, and not an ML system.
Kubernetes gives you:
- Pod.status
- Container states
- Events
- PVC / PV / StorageClass
- Node conditions
- Owner references (ReplicaSet / Deployment / StatefulSet)
You still have to manually answer:
"What is the most likely reason this Pod is failing?"
This tool answers that question using:
- Explicit rule contracts
- Structured object-graph reasoning
- Timeline normalization
- Causal chains
- Conflict resolution
- Compositional confidence scoring
All behavior is deterministic and fully test-covered.
The engine operates on a normalized object graph:
context = { "pod": pod, "events": events, "objects": { "pvc": {...}, "pv": {...}, "node": {...}, "storageclass": {...}, "owner": {...}, } }
Supported first-class objects include:
- Pod
- PersistentVolumeClaim
- PersistentVolume
- StorageClass
- Node
- ReplicaSet
- Deployment
- StatefulSet
- ServiceAccount
- Secrets
- NodeConditions (structured)
"requires = { "objects": ["pvc", "pv"], "optional": ["storageclass"] }"
The engine normalizes legacy flat context into this object-graph model automatically. Object state always has precedence over raw Events.
Object state > Conditions > Timeline > Raw events
This significantly improves determinism and confidence accuracy.
Raw Kubernetes events are normalized into structured semantic signals:
- NormalizedEvent:
- kind (Scheduling / Image / Volume / Generic)
- phase (Failure / Info)
- reason
- source
- Semantic matching (timeline.has(kind="Scheduling", phase="Failure"))
- Repeated reason detection
- Pattern matching
- Duration measurement between related events
- Repeated-event escalation detection
This moves diagnosis from snapshot inspection to incident reasoning.
Rules do not return flat explanations they return structured causal chains:
CausalChain( causes=[...], symptoms=[...], contributing=[...], )
The engine then:
- Aggregates matches
- Selects the highest-confidence root cause
- Preserves supporting causes
- Applies suppression semantics
- Emits a structured result
This enables explainability and deterministic reasoning.
Rules can explicitly block other rules:
blocks = ["FailedScheduling", "UnschedulableTaint"]
Compound rules automatically subsume lower-level crash signals.
- Rules are evaluated in priority order
- Compound rules suppress container-level signals
- Suppression map is preserved in output
- Only unsuppressed winners are returned
- PVC dominance over scheduling errors
- Compound rule precedence
- YAML rule safety
- Engine invariants
Confidence is not static.
Final confidence is computed as:
confidence = rule_confidence * evidence_quality * data_completeness * conflict_penalty
This makes confidence:
- Deterministic
- Explainable
- Predictable under partial input
- Stable under rule reordering
Confidence is always bounded to [0,1].
- AdmissionWebhookDenied
- ImagePolicyWebhookRejected
- PrivilegedNotAllowed
- SecurityContextViolation
- LimitRangeViolation
- ResourceQuotaExceeded
- RBACForbidden
- ServiceAccountMissing
- ServiceAccountRBAC
- ExpiredServiceAccountToken
- TokenProjectionFailure
- FailedScheduling
- AffinityUnsatisfiable
- TopologySpreadUnsatisfiable
- PodTopologySpreadSkewTooHigh
- PodOverheadExceededNodeCapacity
- NodeSelectorMismatch
- NodeUnschedulableCordoned
- ExtendedResourceUnavailable
- InsufficientResources
- UnschedulableTaint
- HostPortConflict
- PreemptedByHigherPriority
- RuntimeClassNotFound
- Compound:
- SchedulingFlapping
- PendingUnschedulable
- PriorityPreemptionChain
- SchedulingTimeoutExceeded
- NodeMemoryPressure
- NodePIDPressure
- NodeDiskPressure
- EvictedRule
- EphemeralStorageExceeded
- Compound:
- NodeNotReadyEvicted
- PVCBoundThenNodePressure
- PVCBoundNodeDiskPressureMount
- AccessModeMismatch
- PVCNotBound
- PVReleasedOrFailed
- PVCMountFailed
- FailedMount
- FilesystemResizePending
- PVCZoneMismatch
- StorageClassProvisionerMissing
- ReadWriteOnceMultiNodeConflict
- ConfigMapNotFound
- VolumeAttachmentTimeout
- CSIPluginNotRegistered
- Compound:
- PVCMountFailure
- PVCPendingTooLong
- DynamicProvisioningTimeout
- PVCPendingThenCrashloop
- PVCThenCrashloopRule
- PVCBoundThenCrashLoop
- PVCThenImagePullFailRule
- PVCRecoveredButAppStillFailing
- ImagePullError
- ImagePullBackOff
- ImagePullSecretMissing
- ImageArchitectureMismatch
- InvalidEntrypoint
- ContainerCreateConfigError
- ContainerRuntimeStartFailure
- ContainerRuntimePermissionDenied
- CrashLoopBackoff
- RegistryRateLimited
- OOMKilled
- ReadOnlyRootFilesystemWriteAttempt
- PreStopHookFailure
- TerminationGracePeriodExceeded
- Compound:
- CrashLoopOOMKilled
- CrashLoopLivenessProbe
- CrashLoopAfterConfigChange
- ImagePullSecretMissingCompound
- ImageUpdatedThenCrashLoop
- ImageTagMutableDrift
- RapidRestartEscalation
- ReadinessProbeFailure
- StartupProbeFailure
- Compound:
- RepeatedProbeFailureEscalation
- DNSResolutionFailure
- CNIPluginFailure
- CNIIPExhaustion
- ServiceEndpointsEmpty
- ServiceNotFound
- Compound:
- NetworkPolicyBlocked
- HostNetworkPortConflict
- ReplicaSetCreateFailure
- ReplicaSetUnavailable
- DeploymentProgressDeadlineExceeded
- StatefulSetUpdateBlocked
- HeadlessServiceMissingForStatefulSet
- DeploymentReplicaMismatch
- PodDisruptionBudgetBlocking
- DaemonSetNodeSelectorMismatch
- Compound:
- HPAUnableToScale
- OwnerBlockedPod
- InitContainerFailureRule
- Compound:
- InitContainerBlocksMain
- MultiContainerPartialFailure
-
Compound:
- ConflictingSignalsResolution
-
Rules are evaluated in priority order.
-
High-priority rules can suppress lower-priority rules, preventing misleading explanations.
-
First matching, unsuppressed rule produces the explanation.
-
Golden tests assert that suppression works correctly.
To install the package locally in development mode, allowing editable imports:
- git clone https://github.com/MoeSalah1999/kubectl_explain_failure
- cd kubectl_explain_failure
- python -m pip install -e .
This ensures you can import the package as:
from kubectl_explain_failure.engine import explain_failure
and run tests or scripts directly.
Install the packaged CLI/plugin entrypoint:
python -m pip install kubectl-explain-failure
This installs the console script:
kubectl-explain-failure
For local packaged testing from source:
python -m pip install .
- Versioning uses Semantic Versioning.
- Current release version is
0.1.0. - Release history is tracked in
CHANGELOG.md.
- Bump
versioninpyproject.toml - Add release notes under a new version in
CHANGELOG.md - Tag release in git
- Publish package build
python -m kubectl_explain_failure
--pod /kubectl_explain_failure/tests/fixtures/pod.json
--events /kubectl_explain_failure/tests/fixtures/events.json
python -m kubectl_explain_failure
--pod /kubectl_explain_failure/tests/fixtures/pending_pod.json
--events /kubectl_explain_failure/tests/fixtures/empty_events.json
--pvc /kubectl_explain_failure/tests/fixtures/pvc_pending.json
--node /kubectl_explain_failure/tests/fixtures/node_disk_pressure.json
python -m kubectl_explain_failure
pod my-pod
--live
--namespace default
--format json
--namespace,--context,--kubeconfig--timeout--event-limit,--event-chunk-size--retries,--retry-backoff--trace-id(correlate structured live-fetch logs across components)
The kubectl-explain-failure plugin forwards directly to the live CLI path.
Example:
kubectl explain-failure my-pod -n default --format json
The output includes:
- root_cause
- confidence
- causal_chain
- suppressed_rules
- evidence
- suggested_next_checks
- metadata
Output is fully deterministic for identical inputs.
GitHub Actions workflow is defined at:
.github/workflows/ci.yml
It provides:
- Matrix quality/test job across OS and Python versions
- Gated live integration job (manual dispatch only)
- Trigger workflow with
run_live=true - Configure required repository secrets for live-cluster pod targets and kube access
- pytest
- hypothesis (property-based testing)
- tox
- mypy
- Golden snapshot testing
- Regression invariants
Tests are not included in the installed package and must be run from the source tree. To run tests in the development environment:
- tox
- tox -e format # code formatting
- tox -e lint # static linting
- tox -e typing # mypy type checks
- tox -e test # pytest suite, including golden tests
Tox automatically installs required dependencies, including:
- pytest
- hypothesis
- mypy
- PyYAML
This ensures tests run in a clean environment and type checks are enforced.
- Exact explanation structure
- Confidence stability
- Suppression correctness
- Temporal reasoning behavior
- Engine invariants
- YAML rule safety
- Object-graph compatibility
- Rule contract enforcement
- PVC dominance semantics
- Live CLI provenance metadata
- Live provider abstraction behavior
- Live provider retry behavior (transient vs non-retryable errors)
The project includes a reusable Hypothesis snapshot generator for Kubernetes-style inputs:
- file:
kubectl_explain_failure/tests/property/strategies.py - primary APIs:
snapshot_strategy()crashloop_snapshot_strategy()pvc_scheduler_snapshot_strategy()malformed_snapshot_strategy()crashloop_oom_snapshot_strategy()unrelated_noise()
The generator produces coherent engine inputs (pod, events, context) and supports snapshot cloning/injection for monotonicity and idempotence properties.
Property tests are configured through:
- file:
kubectl_explain_failure/tests/property/conftest.py - profiles:
fast(default, local dev)deep(higher example count for CI/fuzzing)
Run with default profile:
venv\\Scripts\\python.exe -m pytest kubectl_explain_failure/tests/property -q
Run with deep profile:
- PowerShell:
$env:HYPOTHESIS_PROFILE="deep" - then:
venv\\Scripts\\python.exe -m pytest kubectl_explain_failure/tests/property -q
Property suite validates engine-level invariants such as:
- idempotence / determinism for identical snapshots
- monotonicity under unrelated object noise
- causal-chain structural integrity
- confidence bounds and output contract stability
- suppression/resolution integrity
- category gating and rule dependency/phase/state gating
The live path is covered at three levels:
- Regression/unit tests with mocked providers and kubectl responses
- Property tests for live adapter normalization and metadata invariants
- Optional real-cluster integration smoke tests (env-gated):
kubectl_explain_failure/tests/integration/test_live_adapter_integration.py- set
KUBECTL_EXPLAIN_FAILURE_RUN_LIVE_INTEGRATION=1to enable
- engine.py - rule evaluation, resolution, confidence composition
- causality.py - CausalChain and Resolution structures
- context.py - context normalization
- timeline.py - normalized timeline abstraction
- relations.py - object dependency graph logic
- loader.py - rule and plugin discovery
- live_adapter.py - live data adapter, provider abstraction, retries, partial-fetch handling
- cli.py - snapshot/live orchestration, provenance, live completeness confidence penalty
- rules/ - rule corpus (Python + YAML)
- tests/ - golden + regression + contract tests
- tests/property/strategies.py - reusable Hypothesis Kubernetes snapshot generator
- tests/property/conftest.py - property-testing profile configuration
class FailureRule:
name: str category: str priority: int requires: dict def matches(...) def explain(...)
All rules must be deterministic and side-effect free.
- No cluster mutation
- No remediation
- No automatic fixes
- No ML-based inference or prediction
- No probabilistic ranking beyond deterministic confidence composition
This is a diagnostic explainer, not a fixer.
- Deterministic over heuristic guessing
- Explicit over implicit
- Structured causality over flat strings
- Object state over event heuristics
- Suppression over ambiguity
- Fully testable behavior
- Additional failure heuristics and rule coverage
- Expanded real-cluster integration scenarios for CI/nightly
- Additional provider implementations behind the live adapter interface
License: MIT