Skip to content

[Feature Request] Support detailed disruption reason detection via new metric kube_pod_status_disruption_reason #2774

@kincoy

Description

@kincoy

What would you like to be added:
Introduce a new metric kube_pod_status_disruption_reason that captures Pod disruption causes by reading the reason from pod.status.conditions (specifically where type == "DisruptionTarget"). This metric should support these possible reasons:

podDisruptionConditionReasons = [
  "PreemptionByScheduler",
  "DeletionByTaintManager",
  "EvictionByEvictionAPI",
  "DeletionByPodGC",
  "TerminationByKubelet"
]

These are the complete set of disruption reasons are defined in the official Kubernetes documentation on pod disruption conditions.

Why is this needed:
The existing kube_pod_status_reason metric is fundamentally limited and often returns zero for all reason labels—even in cases where pods are clearly being evicted. The root cause lies in its implementation: it only matches pod.status.reason == "Evicted", which is only set when eviction is initiated directly by the kubelet (e.g., due to resource pressure or shutdown). This logic completely overlooks evictions triggered via the Eviction API (e.g., using kubectl drain), which instead set pod.status.condition.reason = "EvictionByEvictionAPI" but leave pod.status.reason unchanged, resulting in a constant metric value of zero.

By extending to all known disruption reasons from Kubernetes (as listed above), the proposed metric enables users to accurately identify and monitor diverse Pod disruption events—whether they’re triggered by the Kubelet, scheduler preemption, taint manager, garbage collection, or Eviction API.

Describe the solution you'd like:

  • Metric definition: kube_pod_status_disruption_reason (Gauge) that captures disruption cause via pod.status.conditions where type equals “DisruptionTarget.”
  • Supported reason values:
    • PreemptionByScheduler
    • DeletionByTaintManager
    • EvictionByEvictionAPI
    • DeletionByPodGC
    • TerminationByKubelet
  • Backward compatibility: Retain the legacy kube_pod_status_reason for existing dashboards and alerts, while clarifying in docs when to prefer the new metric.
  • Implementation guidance:
    • In KSM implementation, inspect pod.status.conditions for DisruptionTarget conditions and emit the corresponding reason label.
    • Add unit/integration tests to simulate each disruption scenario.
    • Update metrics.md documentation with clear guidance, usage examples, and migration notes.

Additional context:
related issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions