The Soperator metrics pipeline provides observability for SLURM clusters running on Kubernetes. It collects metrics from various sources, stores them in VictoriaMetrics, and provides visualization through Grafana.
┌────────────────────────────────────────────────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ System Nodes │ │ GPU Nodes │ │ All K8s Nodes │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ SLURM │ │ Soperator │ │ Kube-state │ │ │ │ DCGM │ │ │ │ Node │ │
│ │ Exporter │ │ Controller │ │ metrics │ │ │ │ Exporter │ │ │ │ Exporter │ │
│ │ :8080 │ │ :8443 │ │ :8081 │ │ │ │ :9400 │ │ │ │ :9100 │ │
│ └─────────────┘ └──────────────┘ └──────────────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└────────────────────────────┬───────────────────────────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────────────────────────────────┴───────────────────────┘
│
┌────────────▼────────────┐
│ VictoriaMetrics │
│ Agent (VMAgent) │
│ Scrapes & Forwards │
└────────────┬────────────┘
│
┌────────────────┴───────────────┐
│ │
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ VictoriaMetrics │ │ Nebius Cloud │
│ Single (VMSingle) │ │ Monitoring │
│ Local Storage │ │ (Remote Write) │
└──────────┬──────────┘ └─────────────────────┘
│
┌──────────▼──────────┐
│ Grafana │
│ Visualization │
└─────────────────────┘
- Purpose: Exports SLURM-specific metrics
- Port: 8080
- Metrics: SLURM nodes state, jobs, controller RPC diagnostics
- Deployment: Runs on system nodes (
slurm.nebius.ai/nodeset=system) - Namespace:
soperator(in the SLURM cluster namespace) - Scrape Interval: 30s (default)
- Label Processing: Automatic removal of Kubernetes metadata labels (
pod,instance,container) - Documentation: slurm-exporter.md
Connection Example:
kubectl port-forward -n soperator deployment/slurm-exporter 8080:8080
curl http://localhost:8080/metricsThe exporter applies metric relabeling to drop volatile Kubernetes labels (pod, instance, container) for counter continuity across restarts.
- Purpose: Exports NVIDIA GPU metrics
- Port: 9400
- Metrics: GPU temperature, power, utilization, memory, errors
- Scrape Interval: 15s
- DaemonSet: Runs on nodes with
nvidia.com/gpu.deploy.dcgm-exporter=true
Connection Example:
# Port-forward to a DCGM exporter pod
kubectl port-forward -n soperator deployment/nvidia-dcgm-exporter 9400:9400
curl http://localhost:9400/metrics- Purpose: Exports node/system metrics
- Port: 9100
- Metrics: CPU, memory, disk, network statistics
- Part of: Prometheus Operator stack
- Purpose: Collects node and container metrics from kubelet
- Endpoints:
/metrics- Core kubelet metrics/metrics/cadvisor- Container and cgroup metrics/metrics/probes- Liveness and readiness probe metrics/metrics/resource- Pod resource metrics
- Scrape Method: VMScrapes targeting node endpoints directly
- Scrape Interval: 30s
Key Metrics:
container_memory_usage_bytes- Container memory usagecontainer_cpu_usage_seconds_total- Container CPU usagekubelet_pod_start_duration_seconds- Pod startup latencykubelet_running_pods- Number of running pods per node
- Purpose: Exports Kubernetes object metrics
- Ports:
- 8080 - Main metrics endpoint (Kubernetes object state)
- 8081 - Telemetry endpoint (self-monitoring)
- Metrics: Pod state metrics (filtered subset)
- Deployment: Single replica deployment in
monitoring-systemnamespace - Configuration:
--resources=podswith metric allowlist filtering
Connection Example:
# Port-forward to main metrics endpoint
kubectl port-forward -n monitoring-system svc/metrics-kube-state-metrics 8080:8080
curl http://localhost:8080/metricsNote: Port 8080 provides Kubernetes object metrics, while port 8081 provides self-monitoring metrics. VMServiceScrape targets port 8080 for cluster monitoring.
- Purpose: Exports controller runtime metrics
- Port: 8443 (through kube-rbac-proxy)
- Metrics: Reconciliation metrics, controller health
- Deployment: Runs on system nodes with the controller manager
- Namespace:
soperator-system - Access: Protected by RBAC proxy, requires proper authentication
Connection Example:
# Port-forward to controller (bypasses RBAC)
kubectl port-forward -n soperator-system deployment/soperator-controller-manager 8080:8080
curl http://localhost:8080/metricsNote: Production scraping requires a ServiceMonitor with proper RBAC authentication.
- Purpose: Scrapes metrics from exporters and forwards to storage
- Features:
- Service discovery via Kubernetes API
- Label filtering and relabeling
- Remote write to multiple destinations
- Stream parsing for efficiency
VMAgent exposes operational metrics on port 8429 for monitoring and debugging.
- Purpose: Time-series database for metrics storage
- Port: 8429
- Retention: 30 days
- Storage: 30Gi persistent volume
- API: Prometheus-compatible query API
Connection Example:
# Port-forward to VMSingle
kubectl port-forward -n monitoring-system svc/vmsingle-metrics-victoria-metrics-k8s-stack 8429:8429
# Query metrics
curl "http://localhost:8429/api/v1/query?query=up"- Endpoint:
https://write.monitoring.{region}.nebius.cloud/projects/{projectId}/buckets/soperator/prometheus - Authentication: Bearer token from
/mnt/cloud-metadata/tsa-token - When: Enabled with
publicEndpointEnabled: true
- Purpose: Metrics visualization and dashboards
- Port: 80
- Authentication: Anonymous access enabled (Editor role)
- Features:
- Pre-configured dashboards
- VictoriaMetrics datasource
- Dashboard discovery from ConfigMaps
- Loki/VictoriaLogs integration
Connection Example:
# Port-forward to Grafana
kubectl port-forward -n monitoring-system svc/metrics-grafana 3000:80
# Access in browser: http://localhost:3000Pre-configured Dashboards:
- Victoria Metrics K8s Stack: Grafana, Kubelet, Kubernetes system, Node Exporter, VictoriaMetrics health
- Soperator Custom: Cluster health, Jobs overview, Workers stats and overview
Dashboards are auto-discovered from ConfigMaps with label grafana_dashboard: "1" in monitored namespaces.