Multi-Cluster Metrics Platform

Unified metrics collection and querying for vLLM and GPU workloads across multiple OpenShift clusters

Architecture

Data Flow:

CLUSTER A → Prometheus → Receiver ┐
CLUSTER B → Prometheus → Receiver ├─→ S3 (flat structure, cluster label in meta.json)
CLUSTER C → Prometheus → Receiver ┘         ↓
                                     Central Store Gateway
                                             ↓
                                       Thanos Querier → Grafana
                                                     (filter by cluster label)

Key Features

✅ Multi-cluster support: Deploy ingestion plane on any number of clusters
✅ Centralized querying: Single Grafana dashboard for all clusters
✅ Unlimited retention: S3-backed storage (90 days configurable)
✅ Flat S3 structure: All clusters write to shared bucket, identified by cluster labels in meta.json
✅ Flexible filtering: Filter by cluster, deployment_uuid, pod_name

Platform Overview & Use Cases

What This Platform Provides

Comprehensive Observability for vLLM Inference & GPU Hardware:

vLLM Inference Metrics:

Request latency breakdown (E2E, TTFT, inter-token, prefill, decode)
Throughput & token generation rates
KV cache utilization & prefix cache hit rates
Request queue depth & concurrency levels
Success/failure rates by model and deployment

GPU Hardware Metrics (DCGM):

GPU utilization, temperature, power consumption
Memory usage & clock frequencies
NVLink bandwidth (critical for multi-GPU inference)
SM activity, tensor core utilization (FP16, Tensor cores)
PCIe throughput

72+ panels covering comprehensive observability across inference and hardware layers.

Point-in-Time Investigation & Historical Analysis

This platform is specifically designed for historical analysis and post-incident investigation:

Data Retention & Resolution:

15-second resolution metrics in S3 object storage
Query any point in time with precision down to 15 seconds
No data loss during pod restarts - WAL replay ensures continuity

⚠️ Important: Data Availability Lag

Metrics are scraped every 15 seconds by Prometheus
Data is uploaded to S3 every 2 hrs
Expect ~2 hour lag before newly scraped metrics are visible in Grafana dashboards
This is due to TSDB compaction cycles and S3 sync intervals
For real-time monitoring (< 2 hours), query Prometheus directly on each cluster

Investigation Capabilities:

Incident Root Cause Analysis
- Navigate to exact time of performance degradation or failure (accounting for 2-hour lag)
- Correlate vLLM request failures with GPU metrics (memory exhaustion, thermal throttling)
- Example: "Why did requests timeout at 2:15 PM yesterday?" → Check GPU memory saturation, KV cache exhaustion, request queue buildup at that exact timestamp
Performance Regression Detection
- Compare current performance against baseline from days/weeks ago
- Identify when latency started degrading or throughput dropped
- Example: "TTFT was 200ms last week, now it's 800ms" → Time-travel to when degradation started, investigate GPU utilization, cache hit rate changes
Cross-Cluster Performance Comparison
- Compare how same model performed on H200 vs A100 at specific times
- Analyze workload distribution patterns across clusters
- Example: "Did cluster A handle peak load better than cluster B during last Tuesday's traffic spike?"
Benchmark Validation & Reproducibility
- Preserve exact metrics from benchmark runs with precise timestamps
- Reproduce and validate performance claims weeks later
- Example: "During our 3:00 PM benchmark on Oct 15, H200 achieved X tokens/sec with Y latency" - verifiable with historical data
Capacity Planning Evidence
- Review historical utilization patterns to justify scaling decisions
- Identify peak usage times and resource constraints
- Example: "GPU memory consistently hit 95% between 2-4 PM daily for past 2 weeks" → Need more GPU memory or better KV cache management
Anomaly Investigation
- Time-travel to unexpected behavior (latency spikes, OOM errors, cache misses)
- Correlate application-level metrics (vLLM) with infrastructure metrics (GPU/DCGM)
- Example: "Why did prefix cache hit rate drop to 0% at 11:23 AM?" → Check if pod was restarted, clearing cache

How to Use for Investigations:

Set Grafana time range to incident window (e.g., "2025-10-28 14:00 to 15:00")
Filter by specific deployment using deployment_uuid or deployment_pod_name variables
Zoom into exact moment of issue (down to 15-second precision)
Correlate across panels:
- Request latency spiking? → Check GPU utilization, KV cache usage
- GPU temperature high? → Check power consumption, clock throttling
- Requests failing? → Check queue depth, memory exhaustion
Export data or share timestamped dashboard snapshot with team

Real-World Investigation Example:

A model served on H200 had intermittent 5-second latency spikes. Using this platform, we time-traveled to that exact period (from 2 days ago) and discovered:

GPU memory usage hit 98% capacity at same timestamps as latency spikes

KV cache evictions occurred, causing decode slowdowns

No hardware throttling or temperature issues

Conclusion: Model needed larger KV cache allocation, not more GPUs

This eliminated guesswork and prevented unnecessary hardware scaling.

Current Deployment Status

Clusters Currently Configured:

psap-8xh200-2 (8x NVIDIA H200 GPUs, IBM Cloud)
- Scraping every 15 seconds
- DCGM + vLLM metrics
- Compaction every 30 minutes
mehulvalidation-wp4vb (A100 GPUs, Azure)
- Scraping every 15 seconds
- DCGM + vLLM metrics
- Compaction every 30 minutes

Access Dashboards:

Grafana URL: https://grafana-psap-obs.apps.ocp4.intlab.redhat.com
H200 Dashboard: /d/6475e6106c33fe/vllm-2b-dcgm-metrics-psap-8xh200-2
A100 Dashboard: /d/07574733404e7f/vllm-2b-dcgm-metrics-mehulvalidation-wp4vb

Scalability

Designed for seamless horizontal scaling:

✅ Adding new clusters: Deploy Thanos Receiver + Prometheus config (~15 minutes per cluster)
✅ Flat S3 architecture: All clusters write to shared S3 bucket with cluster labels
✅ Cross-cluster queries: Single dashboard queries all clusters using cluster=~".*" wildcard
✅ No central bottleneck: Each cluster has independent ingestion pipeline
✅ Tested capacity: Supports 5+ clusters; can scale to 10+ with Store Gateway sharding

When adding new clusters (e.g., more H200s, H100s, other GPU platforms):

Deploy Thanos Receiver to new cluster (15 min)
Configure Prometheus remote_write (5 min)
Metrics automatically appear in existing Grafana dashboards (after ~2 hour lag)

Key Use Cases

This platform enables:

Point-in-time incident investigation - Root cause analysis with 15-second precision
Performance regression detection - Compare current vs historical baselines across weeks
Cross-cluster benchmarking - H200 vs A100 performance comparison at specific timestamps
Capacity planning - Evidence-based scaling decisions from historical usage patterns
Model optimization - Identify bottlenecks (KV cache, GPU memory, compute) over time
Cost optimization - Eliminate underutilized resources based on actual usage data
SLA validation - Prove performance commitments with timestamped historical data
Post-mortem analysis - Detailed investigation without "I wish we had metrics for that moment"

Quick Start

1. Deploy Central Query Plane (Once)

On your central/management cluster:

cd central-query-plane
export NAMESPACE=psap-obs
./scripts/deploy.sh

Components deployed:

Thanos Store Gateway (reads S3)
Thanos Querier (unified query interface)
Grafana (visualization)

2. Deploy Cluster Ingestion Plane (Per Cluster)

On each cluster where you run vLLM workloads:

cd cluster-ingestion-plane

# Set cluster-specific name
export CLUSTER_NAME=ocp4-intlab  # Change per cluster!
export NAMESPACE=kserve-e2e-perf

# Deploy
./scripts/deploy.sh

# Validate
./scripts/validate.sh

Components deployed:

Thanos Receiver (accepts remote_write, uploads to S3)
Prometheus remote_write configuration (with cluster label)
ServiceMonitor (scrapes vLLM + DCGM metrics)

3. Access Grafana

# Get URL
oc get route grafana -n psap-obs -o jsonpath='{.spec.host}'

# Login: admin / admin

Use cluster filter dropdown to query specific clusters or all clusters.

Directory Structure

metrics-platform/
├── README.md                          # This file
├── cluster-ingestion-plane/           # Deploy per cluster
│   ├── README.md
│   ├── s3-config/
│   │   └── 01-s3-secret.yaml
│   ├── receiver/
│   │   └── 01-deployment.yaml         # With ${CLUSTER_NAME}
│   ├── prometheus-config/
│   │   └── 01-user-workload-monitoring-config.yaml
│   ├── servicemonitor/
│   │   └── 01-vllm-servicemonitor.yaml
│   └── scripts/
│       ├── deploy.sh                  # Automated deployment
│       └── validate.sh                # Validation
└── central-query-plane/               # Deploy once
    ├── README.md
    ├── store-gateway/
    │   ├── 01-s3-secret.yaml
    │   └── 02-deployment.yaml
    ├── querier/
    │   └── 01-deployment.yaml
    ├── grafana/
    │   ├── 01-deployment.yaml
    │   └── 02-add-cluster-variable.py  # Dashboard tool
    └── scripts/
        └── deploy.sh

Configuration

Cluster Name

Each cluster must have a unique cluster name:

export CLUSTER_NAME=ocp4-intlab          # Cluster A
export CLUSTER_NAME=mehul-validation     # Cluster B
export CLUSTER_NAME=production-us-east   # Cluster C

This name becomes:

A label on all metrics: cluster="ocp4-intlab"
A label in each block's meta.json file in S3

S3 Organization

Flat Structure - All clusters write to the same S3 bucket root:

s3://model-furnace-metrics/
├── 01K8KR0YWR3DTHM9239J3AXQ2Y/
│   ├── meta.json              # Contains: {"cluster": "psap-8xh200-2", ...}
│   ├── chunks/
│   └── index
├── 01K8M3SABCFSTXXFQX0SWF1J9Y/
│   ├── meta.json              # Contains: {"cluster": "mehulvalidation-wp4vb", ...}
│   ├── chunks/
│   └── index
└── ...

Cluster identification:

No separate directories per cluster
Cluster name stored as label in each block's meta.json file
Thanos Querier filters blocks by cluster label for queries

Dashboard Variables

Grafana dashboard supports:

cluster_name: Multi-select (All / specific clusters)
deployment_uuid: Filter by deployment
pod_name: Filter by pod

Operational Guide

Adding a New Cluster

Deploy cluster-ingestion-plane with new CLUSTER_NAME
Wait 2-3 hours for first S3 upload
New cluster appears in Grafana dropdown automatically

Monitoring Health

# Check ingestion on cluster
cd cluster-ingestion-plane
export CLUSTER_NAME=your-cluster
./scripts/validate.sh

# Check central query plane
QUERIER_POD=$(oc get pods -n psap-obs -l app=thanos-querier -o jsonpath='{.items[0].metadata.name}')
oc exec -n psap-obs $QUERIER_POD -- wget -q -O- 'http://localhost:9090/api/v1/label/cluster/values'

Troubleshooting

See detailed troubleshooting in:

cluster-ingestion-plane/README.md
central-query-plane/README.md

Resource Requirements

Per Cluster (Ingestion)

Component	CPU	Memory	Storage
Receiver	1-2 cores	4-8Gi	50Gi PVC

Central (Query)

Component	CPU	Memory	Storage
Store Gateway	2-4 cores	16-32Gi	50Gi PVC
Querier	2-4 cores	8-16Gi	N/A
Grafana	250m-500m	512Mi-1Gi	N/A

Cost Analysis

Example for 3 clusters (moderate workload):

Per cluster: ~20 GB/month compressed
Total: ~60 GB/month
S3 cost: ~$1.40/month
After 1 year: ~$10/month (with lifecycle policies)

See main README for detailed cost breakdown.

Documentation

Cluster Ingestion Plane README - Detailed per-cluster deployment
Central Query Plane README - Detailed centralized query setup
Original Single-Cluster Guide - Legacy documentation

Support

For issues:

Run validation scripts (validate.sh)
Check component logs
Review detailed READMEs for each plane
Check Thanos documentation: https://thanos.io

Version: 2.0.0-multi-cluster Last Updated: 2025-10-28

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
central-query-plane		central-query-plane
cluster-ingestion-plane		cluster-ingestion-plane
cost_explorer		cost_explorer
DEMO_GUIDE.md		DEMO_GUIDE.md
GUIDELLM_DASHBOARD_SETUP.md		GUIDELLM_DASHBOARD_SETUP.md
My_notes.txt		My_notes.txt
README-single-cluster-original.md		README-single-cluster-original.md
README.md		README.md
add_guidellm_panels.py		add_guidellm_panels.py
architecture-diagram.png		architecture-diagram.png
cluster-monitoring-config.yaml		cluster-monitoring-config.yaml
generate-cluster-dashboards.py		generate-cluster-dashboards.py
upload-cluster-dashboard.sh		upload-cluster-dashboard.sh
upload-vllm-dcgm-dashboard.sh		upload-vllm-dcgm-dashboard.sh
user-workload-monitoring-config.yaml		user-workload-monitoring-config.yaml
vllm-dcgm-complete-dashboard.json		vllm-dcgm-complete-dashboard.json
vllm-dcgm-mehulvalidation-wp4vb-dashboard.json		vllm-dcgm-mehulvalidation-wp4vb-dashboard.json
vllm-dcgm-psap-8xh200-2-dashboard.json		vllm-dcgm-psap-8xh200-2-dashboard.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Cluster Metrics Platform

Architecture

Key Features

Platform Overview & Use Cases

What This Platform Provides

Point-in-Time Investigation & Historical Analysis

Current Deployment Status

Scalability

Key Use Cases

Quick Start

1. Deploy Central Query Plane (Once)

2. Deploy Cluster Ingestion Plane (Per Cluster)

3. Access Grafana

Directory Structure

Configuration

Cluster Name

S3 Organization

Dashboard Variables

Operational Guide

Adding a New Cluster

Monitoring Health

Troubleshooting

Resource Requirements

Per Cluster (Ingestion)

Central (Query)

Cost Analysis

Documentation

Support

About

Uh oh!

Releases

Packages

Languages

MML-coder/llm-benchmark-observability

Folders and files

Latest commit

History

Repository files navigation

Multi-Cluster Metrics Platform

Architecture

Key Features

Platform Overview & Use Cases

What This Platform Provides

Point-in-Time Investigation & Historical Analysis

Current Deployment Status

Scalability

Key Use Cases

Quick Start

1. Deploy Central Query Plane (Once)

2. Deploy Cluster Ingestion Plane (Per Cluster)

3. Access Grafana

Directory Structure

Configuration

Cluster Name

S3 Organization

Dashboard Variables

Operational Guide

Adding a New Cluster

Monitoring Health

Troubleshooting

Resource Requirements

Per Cluster (Ingestion)

Central (Query)

Cost Analysis

Documentation

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages