-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Problem
The current pod filtering logic in saturation metrics collection uses Prometheus kube_pod_info queries to filter out stale/terminated pods. This approach has several limitations:
- Staleness window: kube-state-metrics has a scrape interval of ~15-30s, meaning pod state can be stale
- Race conditions: New pods may not appear in kube_pod_info immediately after creation
- Complexity: The regex-based deployment name filtering adds complexity and potential edge cases
- Workarounds: The code includes fallback logic that skips filtering when kube_pod_info returns empty
Current Implementation
From internal/collector/prometheus/saturation_metrics.go:
```go
// Prometheus retains metrics from terminated pods for a time period, causing stale metrics to be pulled.
// Verify pod existence using Prometheus kube-state-metrics to filter out stale pods.
existingPods := cmc.getExistingPods(ctx, namespace, deployments, podSet)
// If getExistingPods returns empty but we have candidate pods with metrics,
// skip the filtering - this handles the case where kube_pod_info hasn't been
// scraped yet for new pods.
```
Proposed Enhancement
Replace the Prometheus-based pod filtering with app-level pod readiness probes:
- Use Kubernetes API directly: Query pod status via the Kubernetes API which has real-time pod state
- Check pod readiness conditions: Use pod
.status.conditionswith typeReadyto determine if pods are actually serving traffic - Consider container readiness: Leverage vLLM's built-in readiness probes (e.g.,
/health/ready) which already exist
Benefits
- Real-time pod state instead of scraped metrics with staleness
- Simpler implementation without regex pattern matching
- More reliable during scale-up/scale-down transitions
- Aligns with Kubernetes native patterns for determining pod availability
Files Affected
internal/collector/prometheus/saturation_metrics.go-getExistingPods()and related filtering logic
Additional Context
From PR #451 review comment by @asm582 noting that the pod filtering logic appears fragile.