Kubernetes AI/ML Model Introspector for vLLM Deployments
Automatically discover, document, and analyze your AI inference infrastructure
Features • Quick Start • Commands • Output Formats • Installation
PIQC (Production Inference Quality Control) is a powerful Kubernetes-native introspection tool designed for AI/ML platform teams. It automatically discovers vLLM inference deployments across your cluster and generates comprehensive, standardized ModelSpec documentation.
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ 🔍 PIQC Scan Flow │
│ │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ K8s │────▶│ Discovery & │────▶│ Collect │────▶│ Generate │ │
│ │ Cluster │ │ Detection │ │ Metrics │ │ ModelSpec │ │
│ └─────────┘ └──────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ • Scans all namespaces • GPU metrics via nvidia-smi │
│ • Detects vLLM workloads • Runtime metrics via vLLM API │
│ • Weighted confidence scoring • KV cache, latency, throughput │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
- Auto-Detection: Automatically discovers vLLM inference deployments across all namespaces
- Weighted Confidence Scoring: Uses multiple signals (images, env vars, CLI args, labels) with weighted scoring
- Framework Detection: Identifies vLLM with high accuracy using pattern matching and heuristics
- GPU Metrics: Real-time GPU utilization, memory, temperature, and power via
nvidia-smi - Runtime Metrics: Collects vLLM API metrics including:
- Request latency (P50, P95, P99)
- Token throughput (prefill & decode)
- KV cache utilization
- Queue depth and active requests
- Health status
| Format | Description |
|---|---|
| YAML | Kubernetes-style ModelSpec files (default) |
| JSON | Machine-readable JSON output |
| Table | Rich console table for quick viewing |
| PIQC Facts | Standardized facts bundle for quality assessment |
- Parallel Processing: Multi-threaded scanning with configurable workers
- RBAC Support: Pre-configured ClusterRole and ServiceAccount manifests
- Flexible Modes: Auto-detect, remote (kubeconfig), or in-cluster execution
- Timeout Controls: Configurable operation timeouts
|
🔴 AMD GPU Support Support for AMD Instinct and Radeon GPUs via
|
🌐 LLM-D (LLM-Distributed) Discovery and documentation for distributed LLM inference:
|
# Verify cluster connectivity and permissions
piqc test-connection# Scan entire cluster with console table output
piqc scan --format table
# Scan and generate YAML ModelSpec files
piqc scan --format yaml -o ./output
# Scan with runtime metrics from vLLM API
piqc scan --collect-runtime --format jsonModelSpec Introspector v1.0.0
========================================
[INFO] Connecting to cluster...
Context: my-k8s-context
Cluster: my-cluster
[INFO] Scanning namespaces...
Discovered: 12 namespace(s)
[INFO] Detecting inference workloads...
Pods analyzed: 47
Inference deployments found: 3
Framework Distribution:
┃ Framework ┃ Count ┃
├───────────┼───────┤
│ vllm │ 3 │
[INFO] Scan completed in 8.2s
Scan Kubernetes cluster for vLLM model deployments and generate ModelSpec documentation.
piqc scan [OPTIONS]| Option | Default | Description |
|---|---|---|
--kubeconfig PATH |
~/.kube/config |
Path to kubeconfig file |
--context TEXT |
current | Kubernetes context to use |
-n, --namespace TEXT |
all | Specific namespace to scan |
--format [yaml|json|table] |
yaml |
Output format |
-o, --output PATH |
./output |
Output directory for generated files |
| Option | Default | Description |
|---|---|---|
--collect-runtime |
false |
Collect runtime metrics via vLLM API |
--no-exec |
false |
Disable pod exec (skip GPU metrics) |
--no-logs |
false |
Disable log reading |
--aggregate/--no-aggregate |
aggregate |
Aggregate metrics across pod replicas |
| Option | Default | Description |
|---|---|---|
--combined |
false |
Generate single combined output file |
--output-piqc |
false |
Generate piqc-facts.json (PIQC v0.1 schema) |
| Option | Default | Description |
|---|---|---|
--timeout INT |
30 |
Operation timeout in seconds |
--workers INT |
5 |
Number of parallel workers |
--mode [auto|remote|incluster|dry-run] |
auto |
Execution mode |
-v, --verbose |
false |
Enable verbose output |
--debug |
false |
Enable debug mode with detailed trace |
# Basic scan - discover all vLLM deployments
piqc scan
# Scan specific namespace with JSON output
piqc scan -n production --format json
# Quick scan without GPU metrics (faster)
piqc scan --no-exec
# Collect runtime metrics from vLLM API
piqc scan --collect-runtime
# Generate PIQC facts bundle for quality assessment
piqc scan --output-piqc -o ./facts
# Combined output file instead of per-deployment files
piqc scan --combined -o ./output
# Table output to console (human-readable)
piqc scan --format table
# Custom kubeconfig and context
piqc scan --kubeconfig /path/to/config --context my-cluster
# Disable metric aggregation across replicas
piqc scan --no-aggregate
# Full verbose debug mode
piqc scan -v --debugTest connection to Kubernetes cluster and verify required permissions.
piqc test-connection [OPTIONS]| Option | Default | Description |
|---|---|---|
--kubeconfig PATH |
~/.kube/config |
Path to kubeconfig file |
--context TEXT |
current | Kubernetes context to use |
ModelSpec Introspector v1.0.0
========================================
[INFO] Testing cluster connection...
Connection successful
Context: my-context
Cluster: my-cluster
[INFO] Testing namespace access...
Accessible namespaces: 15
All checks passed
Display version information.
piqc version
# Output: ModelSpec Introspector v1.0.0Generates individual Kubernetes-style YAML files for each deployment:
apiVersion: modelspec/v1
kind: ModelSpec
metadata:
name: vllm-llama-7b
namespace: inference
collectionTimestamp: "2024-01-07T12:00:00Z"
collectorVersion: "1.0.0"
model:
name: meta-llama/Llama-2-7b-hf
architecture: llama
parameters: "7B"
identificationConfidence: 0.95
engine:
name: vllm
version: "0.4.0"
detectionConfidence: 0.95
inference:
precision: float16
tensorParallelSize: 4
maxModelLen: 4096
gpuMemoryUtilization: 0.90
resources:
replicas: 2
gpuCount: 4
gpus:
- type: A100-SXM4-80GB
memoryTotal: "80GB"
utilization: 87
memoryUsed: 72000
runtimeState:
vllm:
healthStatus: healthy
kvCacheUsagePercent: 45.2
avgPromptThroughput: 1250.5
avgGenerationThroughput: 85.3
dataCompleteness:
staticConfig: true
gpuMetrics: true
runtimeMetrics: trueSame structure as YAML but in JSON format, ideal for programmatic processing.
Rich console table for quick human-readable viewing:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Model Name ┃ Engine ┃ GPU Type ┃ Replicas ┃ GPU Util ┃ Namespace ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ meta-llama/Llama-2-7b-hf │ vllm │ 4×A100-SXM4-80GB │ 2 │ 87% │ inference │
│ mistralai/Mistral-7B │ vllm │ 2×A100-40GB │ 1 │ 72% │ production │
│ Qwen/Qwen2-72B │ vllm │ 8×H100-SXM5-80GB │ 3 │ 91% │ ml-serving │
└───────────────────────────┴────────┴────────────────────┴──────────┴──────────┴─────────────┘
With --output-piqc, generates a standardized facts bundle for quality assessment systems:
{
"schemaVersion": "piqc-scan.v0.1",
"generatedAt": "2024-01-07T12:00:00Z",
"tool": {
"name": "piqc",
"version": "1.0.0"
},
"cluster": {
"context": "my-context",
"name": "my-cluster"
},
"objects": [
{
"workloadId": "ns/inference/deployment/vllm-llama-7b",
"facts": {
"runtime.engineType": {"value": "vllm", "dataConfidence": "high"},
"runtime.engineVersion": {"value": "0.4.0", "dataConfidence": "medium"},
"hardware.gpuType": {"value": "A100-SXM4-80GB", "dataConfidence": "high"},
"hardware.gpuCount": {"value": 4, "dataConfidence": "high"},
"hardware.gpuMemoryTotal": {"value": 80, "unit": "GB", "dataConfidence": "high"},
"observed.gpuUtilization": {"value": 87, "unit": "%", "dataConfidence": "high"},
"vllm.tensorParallelSize": {"value": 4, "dataConfidence": "high"},
"vllm.maxModelLen": {"value": 4096, "dataConfidence": "high"},
"observed.kvCacheUsage": {"value": 45.2, "unit": "%", "dataConfidence": "high"}
}
}
]
}- Python: 3.11 or higher
- Kubernetes Access: Valid kubeconfig with cluster access
- Poetry: For development installation
# Clone the repository
git clone https://github.com/paralleliq/piqc.git
cd piqc
# Install with Poetry
poetry install
# Verify installation
poetry run piqc --version# Clone and install with dev dependencies
git clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install --with dev
# Run tests
poetry run pytest tests/unit -v
# Run with coverage
poetry run pytest tests/unit --cov=src/piqcPIQC requires specific Kubernetes permissions. Apply the provided RBAC manifests:
kubectl apply -f rbac/| Resource | Verbs | Purpose |
|---|---|---|
pods |
get, list | Discover inference workloads |
pods/exec |
create | Run nvidia-smi for GPU metrics |
pods/log |
get | Enhanced framework detection |
namespaces |
get, list | Scan multiple namespaces |
deployments |
get, list | Identify deployment metadata |
statefulsets |
get, list | Identify StatefulSet workloads |
services |
get, list | Endpoint detection |
rbac/
├── serviceaccount.yaml # ServiceAccount for PIQC
├── clusterrole.yaml # ClusterRole with required permissions
└── clusterrolebinding.yaml # Binds role to service account
| Mode | Description |
|---|---|
auto |
Automatically detect if running in-cluster or remotely |
remote |
Force remote mode (uses kubeconfig) |
incluster |
Force in-cluster mode (uses ServiceAccount) |
dry-run |
Simulate scan without cluster access |
# Verify kubeconfig is valid
kubectl cluster-info
# Test with specific context
piqc test-connection --context my-context
# Enable debug mode for detailed errors
piqc scan --debug# Check current permissions
kubectl auth can-i list pods --all-namespaces
kubectl auth can-i create pods/exec -n <namespace>
# Apply RBAC manifests
kubectl apply -f rbac/If nvidia-smi is not available in containers, use --no-exec:
piqc scan --no-execEnsure the vLLM service is accessible. Use --collect-runtime and check:
# Verify vLLM health endpoint
kubectl port-forward svc/<vllm-service> 8000:8000
curl http://localhost:8000/health# Run all unit tests
poetry run pytest tests/unit -v
# Run with coverage
poetry run pytest tests/unit --cov=src/piqc
# Run integration tests (requires cluster)
poetry run pytest tests/integration -v# Format code with Black
poetry run black src/ tests/
# Lint code with Ruff
poetry run ruff check src/ tests/
# Type checking with MyPy
poetry run mypy src/piqc/
├── src/piqc/
│ ├── cli/ # CLI commands (scan, test-connection, version)
│ ├── collectors/ # Data collectors (vLLM config, GPU metrics)
│ ├── core/ # Core logic (orchestrator, discovery, k8s client)
│ ├── generators/ # Output generators (YAML, JSON, Table, PIQC)
│ ├── models/ # Pydantic data models (ModelSpec, PIQC schema)
│ ├── parsers/ # Configuration parsers (vLLM)
│ └── utils/ # Utilities (logging, exceptions)
├── tests/
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests (with mock containers)
├── rbac/ # Kubernetes RBAC manifests
├── docs/ # Documentation (LaTeX guides)
└── examples/ # Example ModelSpec files
Apache License 2.0 - see LICENSE for details.
Built with ❤️ by ParallelIQ
🚀 Model-aware GPU Control Plane