Skip to content

MML-coder/llm-benchmark-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Cluster Metrics Platform

Unified metrics collection and querying for vLLM and GPU workloads across multiple OpenShift clusters


Architecture

Architecture Diagram

Data Flow:

CLUSTER A → Prometheus → Receiver ┐
CLUSTER B → Prometheus → Receiver ├─→ S3 (flat structure, cluster label in meta.json)
CLUSTER C → Prometheus → Receiver ┘         ↓
                                     Central Store Gateway
                                             ↓
                                       Thanos Querier → Grafana
                                                     (filter by cluster label)

Key Features

  • Multi-cluster support: Deploy ingestion plane on any number of clusters
  • Centralized querying: Single Grafana dashboard for all clusters
  • Unlimited retention: S3-backed storage (90 days configurable)
  • Flat S3 structure: All clusters write to shared bucket, identified by cluster labels in meta.json
  • Flexible filtering: Filter by cluster, deployment_uuid, pod_name

Platform Overview & Use Cases

What This Platform Provides

Comprehensive Observability for vLLM Inference & GPU Hardware:

vLLM Inference Metrics:

  • Request latency breakdown (E2E, TTFT, inter-token, prefill, decode)
  • Throughput & token generation rates
  • KV cache utilization & prefix cache hit rates
  • Request queue depth & concurrency levels
  • Success/failure rates by model and deployment

GPU Hardware Metrics (DCGM):

  • GPU utilization, temperature, power consumption
  • Memory usage & clock frequencies
  • NVLink bandwidth (critical for multi-GPU inference)
  • SM activity, tensor core utilization (FP16, Tensor cores)
  • PCIe throughput

72+ panels covering comprehensive observability across inference and hardware layers.

Point-in-Time Investigation & Historical Analysis

This platform is specifically designed for historical analysis and post-incident investigation:

Data Retention & Resolution:

  • 15-second resolution metrics in S3 object storage
  • Query any point in time with precision down to 15 seconds
  • No data loss during pod restarts - WAL replay ensures continuity

⚠️ Important: Data Availability Lag

  • Metrics are scraped every 15 seconds by Prometheus
  • Data is uploaded to S3 every 2 hrs
  • Expect ~2 hour lag before newly scraped metrics are visible in Grafana dashboards
  • This is due to TSDB compaction cycles and S3 sync intervals
  • For real-time monitoring (< 2 hours), query Prometheus directly on each cluster

Investigation Capabilities:

  1. Incident Root Cause Analysis

    • Navigate to exact time of performance degradation or failure (accounting for 2-hour lag)
    • Correlate vLLM request failures with GPU metrics (memory exhaustion, thermal throttling)
    • Example: "Why did requests timeout at 2:15 PM yesterday?" → Check GPU memory saturation, KV cache exhaustion, request queue buildup at that exact timestamp
  2. Performance Regression Detection

    • Compare current performance against baseline from days/weeks ago
    • Identify when latency started degrading or throughput dropped
    • Example: "TTFT was 200ms last week, now it's 800ms" → Time-travel to when degradation started, investigate GPU utilization, cache hit rate changes
  3. Cross-Cluster Performance Comparison

    • Compare how same model performed on H200 vs A100 at specific times
    • Analyze workload distribution patterns across clusters
    • Example: "Did cluster A handle peak load better than cluster B during last Tuesday's traffic spike?"
  4. Benchmark Validation & Reproducibility

    • Preserve exact metrics from benchmark runs with precise timestamps
    • Reproduce and validate performance claims weeks later
    • Example: "During our 3:00 PM benchmark on Oct 15, H200 achieved X tokens/sec with Y latency" - verifiable with historical data
  5. Capacity Planning Evidence

    • Review historical utilization patterns to justify scaling decisions
    • Identify peak usage times and resource constraints
    • Example: "GPU memory consistently hit 95% between 2-4 PM daily for past 2 weeks" → Need more GPU memory or better KV cache management
  6. Anomaly Investigation

    • Time-travel to unexpected behavior (latency spikes, OOM errors, cache misses)
    • Correlate application-level metrics (vLLM) with infrastructure metrics (GPU/DCGM)
    • Example: "Why did prefix cache hit rate drop to 0% at 11:23 AM?" → Check if pod was restarted, clearing cache

How to Use for Investigations:

  1. Set Grafana time range to incident window (e.g., "2025-10-28 14:00 to 15:00")
  2. Filter by specific deployment using deployment_uuid or deployment_pod_name variables
  3. Zoom into exact moment of issue (down to 15-second precision)
  4. Correlate across panels:
    • Request latency spiking? → Check GPU utilization, KV cache usage
    • GPU temperature high? → Check power consumption, clock throttling
    • Requests failing? → Check queue depth, memory exhaustion
  5. Export data or share timestamped dashboard snapshot with team

Real-World Investigation Example:

A model served on H200 had intermittent 5-second latency spikes. Using this platform, we time-traveled to that exact period (from 2 days ago) and discovered:

  • GPU memory usage hit 98% capacity at same timestamps as latency spikes
  • KV cache evictions occurred, causing decode slowdowns
  • No hardware throttling or temperature issues
  • Conclusion: Model needed larger KV cache allocation, not more GPUs

This eliminated guesswork and prevented unnecessary hardware scaling.

Current Deployment Status

Clusters Currently Configured:

  1. psap-8xh200-2 (8x NVIDIA H200 GPUs, IBM Cloud)

    • Scraping every 15 seconds
    • DCGM + vLLM metrics
    • Compaction every 30 minutes
  2. mehulvalidation-wp4vb (A100 GPUs, Azure)

    • Scraping every 15 seconds
    • DCGM + vLLM metrics
    • Compaction every 30 minutes

Access Dashboards:

  • Grafana URL: https://grafana-psap-obs.apps.ocp4.intlab.redhat.com
  • H200 Dashboard: /d/6475e6106c33fe/vllm-2b-dcgm-metrics-psap-8xh200-2
  • A100 Dashboard: /d/07574733404e7f/vllm-2b-dcgm-metrics-mehulvalidation-wp4vb

Scalability

Designed for seamless horizontal scaling:

  • Adding new clusters: Deploy Thanos Receiver + Prometheus config (~15 minutes per cluster)
  • Flat S3 architecture: All clusters write to shared S3 bucket with cluster labels
  • Cross-cluster queries: Single dashboard queries all clusters using cluster=~".*" wildcard
  • No central bottleneck: Each cluster has independent ingestion pipeline
  • Tested capacity: Supports 5+ clusters; can scale to 10+ with Store Gateway sharding

When adding new clusters (e.g., more H200s, H100s, other GPU platforms):

  1. Deploy Thanos Receiver to new cluster (15 min)
  2. Configure Prometheus remote_write (5 min)
  3. Metrics automatically appear in existing Grafana dashboards (after ~2 hour lag)

Key Use Cases

This platform enables:

  • Point-in-time incident investigation - Root cause analysis with 15-second precision
  • Performance regression detection - Compare current vs historical baselines across weeks
  • Cross-cluster benchmarking - H200 vs A100 performance comparison at specific timestamps
  • Capacity planning - Evidence-based scaling decisions from historical usage patterns
  • Model optimization - Identify bottlenecks (KV cache, GPU memory, compute) over time
  • Cost optimization - Eliminate underutilized resources based on actual usage data
  • SLA validation - Prove performance commitments with timestamped historical data
  • Post-mortem analysis - Detailed investigation without "I wish we had metrics for that moment"

Quick Start

1. Deploy Central Query Plane (Once)

On your central/management cluster:

cd central-query-plane
export NAMESPACE=psap-obs
./scripts/deploy.sh

Components deployed:

  • Thanos Store Gateway (reads S3)
  • Thanos Querier (unified query interface)
  • Grafana (visualization)

2. Deploy Cluster Ingestion Plane (Per Cluster)

On each cluster where you run vLLM workloads:

cd cluster-ingestion-plane

# Set cluster-specific name
export CLUSTER_NAME=ocp4-intlab  # Change per cluster!
export NAMESPACE=kserve-e2e-perf

# Deploy
./scripts/deploy.sh

# Validate
./scripts/validate.sh

Components deployed:

  • Thanos Receiver (accepts remote_write, uploads to S3)
  • Prometheus remote_write configuration (with cluster label)
  • ServiceMonitor (scrapes vLLM + DCGM metrics)

3. Access Grafana

# Get URL
oc get route grafana -n psap-obs -o jsonpath='{.spec.host}'

# Login: admin / admin

Use cluster filter dropdown to query specific clusters or all clusters.


Directory Structure

metrics-platform/
├── README.md                          # This file
├── cluster-ingestion-plane/           # Deploy per cluster
│   ├── README.md
│   ├── s3-config/
│   │   └── 01-s3-secret.yaml
│   ├── receiver/
│   │   └── 01-deployment.yaml         # With ${CLUSTER_NAME}
│   ├── prometheus-config/
│   │   └── 01-user-workload-monitoring-config.yaml
│   ├── servicemonitor/
│   │   └── 01-vllm-servicemonitor.yaml
│   └── scripts/
│       ├── deploy.sh                  # Automated deployment
│       └── validate.sh                # Validation
└── central-query-plane/               # Deploy once
    ├── README.md
    ├── store-gateway/
    │   ├── 01-s3-secret.yaml
    │   └── 02-deployment.yaml
    ├── querier/
    │   └── 01-deployment.yaml
    ├── grafana/
    │   ├── 01-deployment.yaml
    │   └── 02-add-cluster-variable.py  # Dashboard tool
    └── scripts/
        └── deploy.sh

Configuration

Cluster Name

Each cluster must have a unique cluster name:

export CLUSTER_NAME=ocp4-intlab          # Cluster A
export CLUSTER_NAME=mehul-validation     # Cluster B
export CLUSTER_NAME=production-us-east   # Cluster C

This name becomes:

  1. A label on all metrics: cluster="ocp4-intlab"
  2. A label in each block's meta.json file in S3

S3 Organization

Flat Structure - All clusters write to the same S3 bucket root:

s3://model-furnace-metrics/
├── 01K8KR0YWR3DTHM9239J3AXQ2Y/
│   ├── meta.json              # Contains: {"cluster": "psap-8xh200-2", ...}
│   ├── chunks/
│   └── index
├── 01K8M3SABCFSTXXFQX0SWF1J9Y/
│   ├── meta.json              # Contains: {"cluster": "mehulvalidation-wp4vb", ...}
│   ├── chunks/
│   └── index
└── ...

Cluster identification:

  • No separate directories per cluster
  • Cluster name stored as label in each block's meta.json file
  • Thanos Querier filters blocks by cluster label for queries

Dashboard Variables

Grafana dashboard supports:

  • cluster_name: Multi-select (All / specific clusters)
  • deployment_uuid: Filter by deployment
  • pod_name: Filter by pod

Operational Guide

Adding a New Cluster

  1. Deploy cluster-ingestion-plane with new CLUSTER_NAME
  2. Wait 2-3 hours for first S3 upload
  3. New cluster appears in Grafana dropdown automatically

Monitoring Health

# Check ingestion on cluster
cd cluster-ingestion-plane
export CLUSTER_NAME=your-cluster
./scripts/validate.sh

# Check central query plane
QUERIER_POD=$(oc get pods -n psap-obs -l app=thanos-querier -o jsonpath='{.items[0].metadata.name}')
oc exec -n psap-obs $QUERIER_POD -- wget -q -O- 'http://localhost:9090/api/v1/label/cluster/values'

Troubleshooting

See detailed troubleshooting in:

  • cluster-ingestion-plane/README.md
  • central-query-plane/README.md

Resource Requirements

Per Cluster (Ingestion)

Component CPU Memory Storage
Receiver 1-2 cores 4-8Gi 50Gi PVC

Central (Query)

Component CPU Memory Storage
Store Gateway 2-4 cores 16-32Gi 50Gi PVC
Querier 2-4 cores 8-16Gi N/A
Grafana 250m-500m 512Mi-1Gi N/A

Cost Analysis

Example for 3 clusters (moderate workload):

  • Per cluster: ~20 GB/month compressed
  • Total: ~60 GB/month
  • S3 cost: ~$1.40/month
  • After 1 year: ~$10/month (with lifecycle policies)

See main README for detailed cost breakdown.


Documentation


Support

For issues:

  1. Run validation scripts (validate.sh)
  2. Check component logs
  3. Review detailed READMEs for each plane
  4. Check Thanos documentation: https://thanos.io

Version: 2.0.0-multi-cluster Last Updated: 2025-10-28

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published