Skip to content

Replace pod filtering with app-level readiness probes #455

@clubanderson

Description

@clubanderson

Problem

The current pod filtering logic in saturation metrics collection uses Prometheus kube_pod_info queries to filter out stale/terminated pods. This approach has several limitations:

  1. Staleness window: kube-state-metrics has a scrape interval of ~15-30s, meaning pod state can be stale
  2. Race conditions: New pods may not appear in kube_pod_info immediately after creation
  3. Complexity: The regex-based deployment name filtering adds complexity and potential edge cases
  4. Workarounds: The code includes fallback logic that skips filtering when kube_pod_info returns empty

Current Implementation

From internal/collector/prometheus/saturation_metrics.go:

```go
// Prometheus retains metrics from terminated pods for a time period, causing stale metrics to be pulled.
// Verify pod existence using Prometheus kube-state-metrics to filter out stale pods.
existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)

// If getExistingPods returns empty but we have candidate pods with metrics,
// skip the filtering - this handles the case where kube_pod_info hasn't been
// scraped yet for new pods.
```

Proposed Enhancement

Replace the Prometheus-based pod filtering with app-level pod readiness probes:

  1. Use Kubernetes API directly: Query pod status via the Kubernetes API which has real-time pod state
  2. Check pod readiness conditions: Use pod .status.conditions with type Ready to determine if pods are actually serving traffic
  3. Consider container readiness: Leverage vLLM's built-in readiness probes (e.g., /health/ready) which already exist

Benefits

  • Real-time pod state instead of scraped metrics with staleness
  • Simpler implementation without regex pattern matching
  • More reliable during scale-up/scale-down transitions
  • Aligns with Kubernetes native patterns for determining pod availability

Files Affected

  • internal/collector/prometheus/saturation_metrics.go - getExistingPods() and related filtering logic

Additional Context

From PR #451 review comment by @asm582 noting that the pod filtering logic appears fragile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions