The SLURM Exporter is a component of Soperator that collects metrics from SLURM clusters and exports them in Prometheus format. It provides comprehensive monitoring capabilities for SLURM cluster health, job status, node states, and controller performance metrics.
The exporter integrates seamlessly with the Prometheus monitoring stack and enables observability for SLURM workloads running on Kubernetes through Soperator.
- Asynchronous metrics collection with configurable intervals (default: 30s)
- Real-time monitoring of SLURM nodes, jobs, and controller performance
- Prometheus-native metrics with standardized naming conventions
- Rich labeling for detailed filtering and aggregation
- Controller RPC diagnostics similar to SLURM's
sdiagcommand - Kubernetes-native deployment as part of Soperator
The SLURM Exporter can be configured using either command-line flags or environment variables. Environment variables take precedence over defaults but are overridden by explicitly provided command-line flags.
The configuration follows this priority order:
- Command-line flags (highest priority) - Explicitly provided flags override all other settings
- Environment variables - Used when flags are not provided
- Default values (lowest priority) - Used when neither flags nor environment variables are set
All configuration options can be set via command-line flags or environment variables:
| Environment Variable | Flag | Description | Default |
|---|---|---|---|
SLURM_EXPORTER_CLUSTER_NAME |
--cluster-name |
The name of the SLURM cluster (required) | none |
SLURM_EXPORTER_CLUSTER_NAMESPACE |
--cluster-namespace |
The namespace of the SLURM cluster | soperator |
SLURM_EXPORTER_SLURM_API_SERVER |
--slurm-api-server |
The address of the SLURM REST API server | http://localhost:6820 |
SLURM_EXPORTER_COLLECTION_INTERVAL |
--collection-interval |
How often to collect metrics from SLURM APIs | 30s |
SLURM_EXPORTER_METRICS_BIND_ADDRESS |
--metrics-bind-address |
Address for the main metrics endpoint | :8080 |
SLURM_EXPORTER_MONITORING_BIND_ADDRESS |
--monitoring-bind-address |
Address for the self-monitoring metrics endpoint | :8081 |
SLURM_EXPORTER_LOG_FORMAT |
--log-format |
Log format: plain or json |
json |
SLURM_EXPORTER_LOG_LEVEL |
--log-level |
Log level: debug, info, warn, error |
debug |
Slurm represents node state as a 32-bit integer where:
- The lowest 4 bits encode 6 mutually exclusive base states:
IDLE,DOWN,ALLOCATED,ERROR,MIXED,UNKNOWN - Additional bits are flag bits that can be combined with base states:
COMPLETING,DRAIN,MAINTENANCE,RESERVED,FAIL,PLANNED
For example, a node can be IDLE+DRAIN (idle but marked for draining) or ALLOCATED+COMPLETING (running jobs but finishing up).
Reference: https://github.com/SchedMD/slurm/blob/master/slurm/slurm.h.in
The exporter represents this as:
state_baselabel: The single base state (IDLE, ALLOCATED, etc.)state_is_*labels: Boolean flags for additional state flags
Boolean state flag label convention:
- Legacy flags (
state_is_drain,state_is_maintenance,state_is_reserved): Use"true"/"false"for backward compatibility - New flags (
state_is_completing,state_is_fail,state_is_planned): Use"true"or empty string ("") to reduce label cardinality (in Victoria Metrics, empty label value === no label, helping avoid the 30 label limit)
| Metric Name & Type | Description & Labels |
|---|---|
| slurm_node_info Gauge |
Provides detailed information about SLURM nodes Labels: • node_name - Name of the SLURM node• instance_id - Kubernetes instance identifier• state_base - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)• state_is_drain - Whether node is in drain state ("true"/"false")• state_is_maintenance - Whether node is in maintenance state ("true"/"false")• state_is_reserved - Whether node is in reserved state ("true"/"false")• state_is_completing - Whether node is in completing state ("true" or empty)• state_is_fail - Whether node is in fail state ("true" or empty)• state_is_planned - Whether node is in planned state ("true" or empty)• state_is_not_responding - Whether the node is marked as not responding ("true" or empty)• state_is_invalid - Whether the node state is considered invalid by SLURM ("true" or empty)• is_unavailable - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise• reservation_name - Reservation that currently includes the node (trimmed to 50 characters)• address - IP address of the node• reason - Reason for current node state (empty string if node has no reason set)• comment - Comment set on the node (e.g., by active checks when GPU health check fails) |
| slurm_node_gpu_seconds_total Counter |
Total GPU seconds accumulated on SLURM nodes Labels: • node_name - Name of the SLURM node• state_base - Base node state• state_is_drain - Drain state flag• state_is_maintenance - Maintenance state flag• state_is_reserved - Reserved state flag |
| slurm_node_fails_total Counter |
Total number of node state transitions to failed states (DOWN/DRAIN) Labels: • node_name - Name of the SLURM node• state_base - Base node state at time of failure• state_is_drain - Drain state flag• state_is_maintenance - Maintenance state flag• state_is_reserved - Reserved state flag• reason - Reason for the node failure |
| slurm_node_unavailability_duration_seconds Histogram |
Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*) Labels: • node_name - Name of the SLURM nodeNote: Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |
| slurm_node_draining_duration_seconds Histogram |
Duration of completed node draining events (DRAIN+ALLOCATED or DRAIN+MIXED) Labels: • node_name - Name of the SLURM nodeNote: Observations are recorded when draining events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |
| slurm_node_cpus_total Gauge |
Total number of CPUs on the node Labels: • node_name - Name of the SLURM node |
| slurm_node_cpus_allocated Gauge |
Number of CPUs currently allocated on the node Labels: • node_name - Name of the SLURM node |
| slurm_node_cpus_idle Gauge |
Number of idle CPUs on the node Labels: • node_name - Name of the SLURM node |
| slurm_node_cpus_effective Gauge |
Effective CPUs on the node (excluding specialized CPUs reserved for system daemons) Labels: • node_name - Name of the SLURM node |
| slurm_node_memory_total_bytes Gauge |
Total memory on the node in bytes Labels: • node_name - Name of the SLURM node |
| slurm_node_memory_allocated_bytes Gauge |
Allocated memory on the node in bytes Labels: • node_name - Name of the SLURM node |
| slurm_node_memory_free_bytes Gauge |
Free memory on the node in bytes Labels: • node_name - Name of the SLURM node |
| slurm_node_memory_effective_bytes Gauge |
Effective memory on the node in bytes (total minus specialized memory reserved for system daemons) Labels: • node_name - Name of the SLURM node |
| slurm_node_partition Gauge |
Maps nodes to their partitions, enabling partition-level aggregation via PromQL joins Labels: • node_name - Name of the SLURM node• partition - Name of the SLURM partition |
| slurm_job_info Gauge |
Detailed information about SLURM jobs Labels: • job_id - SLURM job identifier• job_state - Current job state (PENDING, RUNNING, COMPLETED, FAILED, etc.)• job_state_reason - Reason for current job state• slurm_partition - SLURM partition name• job_name - User-defined job name• user_name - Username who submitted the job• user_id - Numeric user ID who submitted the job• standard_error - Path to stderr file• standard_output - Path to stdout file• array_job_id - Array job ID (if applicable)• array_task_id - Array task ID (if applicable)• submit_time - When the job was submitted (Unix timestamp seconds, empty if not available or zero)• start_time - When the job started execution (Unix timestamp seconds, empty if not available or zero)• end_time - When the job completed (Unix timestamp seconds, empty if not available or zero). Warning: For non-terminal states like RUNNING, this may contain a future timestamp representing the forecasted end time based on the job's time limit• finished_time - When the job actually finished for terminal states only (Unix timestamp seconds, empty for non-terminal states or if end_time is zero). Unlike end_time, this field only contains actual completion times, never forecasted values |
| slurm_node_job Gauge |
Mapping between jobs and the nodes they're running on Labels: • job_id - SLURM job identifier• node_name - Name of the node running the job |
| slurm_job_duration_seconds Gauge |
Job duration in seconds. For running jobs, this is the time elapsed since the job started. For completed jobs, this is the total execution time. Labels: • job_id - SLURM job identifierNotes: • Only exported for jobs with a valid start time • For non-terminal states (RUNNING, etc.): duration = current_time - start_time • For terminal states (COMPLETED, FAILED, etc.): duration = end_time - start_time (only if end_time is valid) |
| slurm_job_cpus Gauge |
Number of CPUs allocated to the job Labels: • job_id - SLURM job identifier |
| slurm_job_memory_bytes Gauge |
Memory allocated to the job in bytes Labels: • job_id - SLURM job identifier |
These metrics provide insights into SLURM controller performance, similar to the output of the sdiag command, and were implemented to address issue #1027.
| Metric Name & Type | Description & Labels |
|---|---|
| slurm_controller_rpc_calls_total Counter |
Total count of RPC calls by message type Labels: • message_type - Type of RPC message (e.g., REQUEST_NODE_INFO, REQUEST_JOB_INFO, REQUEST_PING) |
| slurm_controller_rpc_duration_seconds_total Counter |
Total time spent processing RPCs by message type (converted from microseconds) Labels: • message_type - Type of RPC message |
| slurm_controller_rpc_user_calls_total Counter |
Total count of RPC calls by user Labels: • user - Username making the RPC calls• user_id - Numeric user ID |
| slurm_controller_rpc_user_duration_seconds_total Counter |
Total time spent on user RPCs (converted from microseconds) Labels: • user - Username making the RPC calls• user_id - Numeric user ID |
| slurm_controller_server_thread_count Gauge |
Number of server threads in the SLURM controller Labels: None |
The exporter provides self-monitoring metrics to track its own health and performance. These metrics are available on a separate endpoint (default port 8081) to avoid mixing operational metrics with business metrics.
| Metric Name & Type | Description & Labels |
|---|---|
| slurm_exporter_collection_duration_seconds Gauge |
Duration of the most recent metrics collection from SLURM APIs Labels: None |
| slurm_exporter_collection_attempts_total Counter |
Total number of metrics collection attempts Labels: None |
| slurm_exporter_collection_failures_total Counter |
Total number of failed metrics collection attempts Labels: None |
| slurm_exporter_metrics_requests_total Counter |
Total number of requests to the /metrics endpointLabels: None |
| slurm_exporter_metrics_exported Gauge |
Number of metrics exported in the last scrape Labels: None |
To access self-monitoring metrics:
# Default monitoring port
curl http://localhost:8081/metrics
# Or with custom monitoring address
./soperator-exporter --monitoring-bind-address=:9090
curl http://localhost:9090/metricsTo run the exporter locally against a cluster for debugging:
- Port-forward the SLURM REST API service:
kubectl port-forward -n soperator svc/soperator-rest-svc 6820:6820- Run the exporter (it finds the JWT secret in the cluster automatically):
go run ./cmd/exporter/main.go --cluster-name=soperator --kubeconfig-path=$HOME/.kube/config- View the metrics:
curl localhost:8080/metricsThe SLURM Exporter integrates with existing Grafana dashboards. Here's an example based on the production dashboard from nebius-solutions-library.