Unified metrics collection and querying for vLLM and GPU workloads across multiple OpenShift clusters
Data Flow:
CLUSTER A → Prometheus → Receiver ┐
CLUSTER B → Prometheus → Receiver ├─→ S3 (flat structure, cluster label in meta.json)
CLUSTER C → Prometheus → Receiver ┘ ↓
Central Store Gateway
↓
Thanos Querier → Grafana
(filter by cluster label)
- ✅ Multi-cluster support: Deploy ingestion plane on any number of clusters
- ✅ Centralized querying: Single Grafana dashboard for all clusters
- ✅ Unlimited retention: S3-backed storage (90 days configurable)
- ✅ Flat S3 structure: All clusters write to shared bucket, identified by cluster labels in meta.json
- ✅ Flexible filtering: Filter by cluster, deployment_uuid, pod_name
Comprehensive Observability for vLLM Inference & GPU Hardware:
vLLM Inference Metrics:
- Request latency breakdown (E2E, TTFT, inter-token, prefill, decode)
- Throughput & token generation rates
- KV cache utilization & prefix cache hit rates
- Request queue depth & concurrency levels
- Success/failure rates by model and deployment
GPU Hardware Metrics (DCGM):
- GPU utilization, temperature, power consumption
- Memory usage & clock frequencies
- NVLink bandwidth (critical for multi-GPU inference)
- SM activity, tensor core utilization (FP16, Tensor cores)
- PCIe throughput
72+ panels covering comprehensive observability across inference and hardware layers.
This platform is specifically designed for historical analysis and post-incident investigation:
Data Retention & Resolution:
- 15-second resolution metrics in S3 object storage
- Query any point in time with precision down to 15 seconds
- No data loss during pod restarts - WAL replay ensures continuity
- Metrics are scraped every 15 seconds by Prometheus
- Data is uploaded to S3 every 2 hrs
- Expect ~2 hour lag before newly scraped metrics are visible in Grafana dashboards
- This is due to TSDB compaction cycles and S3 sync intervals
- For real-time monitoring (< 2 hours), query Prometheus directly on each cluster
Investigation Capabilities:
-
Incident Root Cause Analysis
- Navigate to exact time of performance degradation or failure (accounting for 2-hour lag)
- Correlate vLLM request failures with GPU metrics (memory exhaustion, thermal throttling)
- Example: "Why did requests timeout at 2:15 PM yesterday?" → Check GPU memory saturation, KV cache exhaustion, request queue buildup at that exact timestamp
-
Performance Regression Detection
- Compare current performance against baseline from days/weeks ago
- Identify when latency started degrading or throughput dropped
- Example: "TTFT was 200ms last week, now it's 800ms" → Time-travel to when degradation started, investigate GPU utilization, cache hit rate changes
-
Cross-Cluster Performance Comparison
- Compare how same model performed on H200 vs A100 at specific times
- Analyze workload distribution patterns across clusters
- Example: "Did cluster A handle peak load better than cluster B during last Tuesday's traffic spike?"
-
Benchmark Validation & Reproducibility
- Preserve exact metrics from benchmark runs with precise timestamps
- Reproduce and validate performance claims weeks later
- Example: "During our 3:00 PM benchmark on Oct 15, H200 achieved X tokens/sec with Y latency" - verifiable with historical data
-
Capacity Planning Evidence
- Review historical utilization patterns to justify scaling decisions
- Identify peak usage times and resource constraints
- Example: "GPU memory consistently hit 95% between 2-4 PM daily for past 2 weeks" → Need more GPU memory or better KV cache management
-
Anomaly Investigation
- Time-travel to unexpected behavior (latency spikes, OOM errors, cache misses)
- Correlate application-level metrics (vLLM) with infrastructure metrics (GPU/DCGM)
- Example: "Why did prefix cache hit rate drop to 0% at 11:23 AM?" → Check if pod was restarted, clearing cache
How to Use for Investigations:
- Set Grafana time range to incident window (e.g., "2025-10-28 14:00 to 15:00")
- Filter by specific deployment using
deployment_uuidordeployment_pod_namevariables - Zoom into exact moment of issue (down to 15-second precision)
- Correlate across panels:
- Request latency spiking? → Check GPU utilization, KV cache usage
- GPU temperature high? → Check power consumption, clock throttling
- Requests failing? → Check queue depth, memory exhaustion
- Export data or share timestamped dashboard snapshot with team
Real-World Investigation Example:
A model served on H200 had intermittent 5-second latency spikes. Using this platform, we time-traveled to that exact period (from 2 days ago) and discovered:
- GPU memory usage hit 98% capacity at same timestamps as latency spikes
- KV cache evictions occurred, causing decode slowdowns
- No hardware throttling or temperature issues
- Conclusion: Model needed larger KV cache allocation, not more GPUs
This eliminated guesswork and prevented unnecessary hardware scaling.
Clusters Currently Configured:
-
psap-8xh200-2 (8x NVIDIA H200 GPUs, IBM Cloud)
- Scraping every 15 seconds
- DCGM + vLLM metrics
- Compaction every 30 minutes
-
mehulvalidation-wp4vb (A100 GPUs, Azure)
- Scraping every 15 seconds
- DCGM + vLLM metrics
- Compaction every 30 minutes
Access Dashboards:
- Grafana URL:
https://grafana-psap-obs.apps.ocp4.intlab.redhat.com - H200 Dashboard:
/d/6475e6106c33fe/vllm-2b-dcgm-metrics-psap-8xh200-2 - A100 Dashboard:
/d/07574733404e7f/vllm-2b-dcgm-metrics-mehulvalidation-wp4vb
Designed for seamless horizontal scaling:
- ✅ Adding new clusters: Deploy Thanos Receiver + Prometheus config (~15 minutes per cluster)
- ✅ Flat S3 architecture: All clusters write to shared S3 bucket with cluster labels
- ✅ Cross-cluster queries: Single dashboard queries all clusters using
cluster=~".*"wildcard - ✅ No central bottleneck: Each cluster has independent ingestion pipeline
- ✅ Tested capacity: Supports 5+ clusters; can scale to 10+ with Store Gateway sharding
When adding new clusters (e.g., more H200s, H100s, other GPU platforms):
- Deploy Thanos Receiver to new cluster (15 min)
- Configure Prometheus remote_write (5 min)
- Metrics automatically appear in existing Grafana dashboards (after ~2 hour lag)
This platform enables:
- Point-in-time incident investigation - Root cause analysis with 15-second precision
- Performance regression detection - Compare current vs historical baselines across weeks
- Cross-cluster benchmarking - H200 vs A100 performance comparison at specific timestamps
- Capacity planning - Evidence-based scaling decisions from historical usage patterns
- Model optimization - Identify bottlenecks (KV cache, GPU memory, compute) over time
- Cost optimization - Eliminate underutilized resources based on actual usage data
- SLA validation - Prove performance commitments with timestamped historical data
- Post-mortem analysis - Detailed investigation without "I wish we had metrics for that moment"
On your central/management cluster:
cd central-query-plane
export NAMESPACE=psap-obs
./scripts/deploy.shComponents deployed:
- Thanos Store Gateway (reads S3)
- Thanos Querier (unified query interface)
- Grafana (visualization)
On each cluster where you run vLLM workloads:
cd cluster-ingestion-plane
# Set cluster-specific name
export CLUSTER_NAME=ocp4-intlab # Change per cluster!
export NAMESPACE=kserve-e2e-perf
# Deploy
./scripts/deploy.sh
# Validate
./scripts/validate.shComponents deployed:
- Thanos Receiver (accepts remote_write, uploads to S3)
- Prometheus remote_write configuration (with cluster label)
- ServiceMonitor (scrapes vLLM + DCGM metrics)
# Get URL
oc get route grafana -n psap-obs -o jsonpath='{.spec.host}'
# Login: admin / adminUse cluster filter dropdown to query specific clusters or all clusters.
metrics-platform/
├── README.md # This file
├── cluster-ingestion-plane/ # Deploy per cluster
│ ├── README.md
│ ├── s3-config/
│ │ └── 01-s3-secret.yaml
│ ├── receiver/
│ │ └── 01-deployment.yaml # With ${CLUSTER_NAME}
│ ├── prometheus-config/
│ │ └── 01-user-workload-monitoring-config.yaml
│ ├── servicemonitor/
│ │ └── 01-vllm-servicemonitor.yaml
│ └── scripts/
│ ├── deploy.sh # Automated deployment
│ └── validate.sh # Validation
└── central-query-plane/ # Deploy once
├── README.md
├── store-gateway/
│ ├── 01-s3-secret.yaml
│ └── 02-deployment.yaml
├── querier/
│ └── 01-deployment.yaml
├── grafana/
│ ├── 01-deployment.yaml
│ └── 02-add-cluster-variable.py # Dashboard tool
└── scripts/
└── deploy.sh
Each cluster must have a unique cluster name:
export CLUSTER_NAME=ocp4-intlab # Cluster A
export CLUSTER_NAME=mehul-validation # Cluster B
export CLUSTER_NAME=production-us-east # Cluster CThis name becomes:
- A label on all metrics:
cluster="ocp4-intlab" - A label in each block's
meta.jsonfile in S3
Flat Structure - All clusters write to the same S3 bucket root:
s3://model-furnace-metrics/
├── 01K8KR0YWR3DTHM9239J3AXQ2Y/
│ ├── meta.json # Contains: {"cluster": "psap-8xh200-2", ...}
│ ├── chunks/
│ └── index
├── 01K8M3SABCFSTXXFQX0SWF1J9Y/
│ ├── meta.json # Contains: {"cluster": "mehulvalidation-wp4vb", ...}
│ ├── chunks/
│ └── index
└── ...
Cluster identification:
- No separate directories per cluster
- Cluster name stored as label in each block's
meta.jsonfile - Thanos Querier filters blocks by
clusterlabel for queries
Grafana dashboard supports:
- cluster_name: Multi-select (All / specific clusters)
- deployment_uuid: Filter by deployment
- pod_name: Filter by pod
- Deploy cluster-ingestion-plane with new
CLUSTER_NAME - Wait 2-3 hours for first S3 upload
- New cluster appears in Grafana dropdown automatically
# Check ingestion on cluster
cd cluster-ingestion-plane
export CLUSTER_NAME=your-cluster
./scripts/validate.sh
# Check central query plane
QUERIER_POD=$(oc get pods -n psap-obs -l app=thanos-querier -o jsonpath='{.items[0].metadata.name}')
oc exec -n psap-obs $QUERIER_POD -- wget -q -O- 'http://localhost:9090/api/v1/label/cluster/values'See detailed troubleshooting in:
cluster-ingestion-plane/README.mdcentral-query-plane/README.md
| Component | CPU | Memory | Storage |
|---|---|---|---|
| Receiver | 1-2 cores | 4-8Gi | 50Gi PVC |
| Component | CPU | Memory | Storage |
|---|---|---|---|
| Store Gateway | 2-4 cores | 16-32Gi | 50Gi PVC |
| Querier | 2-4 cores | 8-16Gi | N/A |
| Grafana | 250m-500m | 512Mi-1Gi | N/A |
Example for 3 clusters (moderate workload):
- Per cluster: ~20 GB/month compressed
- Total: ~60 GB/month
- S3 cost: ~$1.40/month
- After 1 year: ~$10/month (with lifecycle policies)
See main README for detailed cost breakdown.
- Cluster Ingestion Plane README - Detailed per-cluster deployment
- Central Query Plane README - Detailed centralized query setup
- Original Single-Cluster Guide - Legacy documentation
For issues:
- Run validation scripts (
validate.sh) - Check component logs
- Review detailed READMEs for each plane
- Check Thanos documentation: https://thanos.io
Version: 2.0.0-multi-cluster Last Updated: 2025-10-28
