SLURM Exporter

Overview

The SLURM Exporter is a component of Soperator that collects metrics from SLURM clusters and exports them in Prometheus format. It provides comprehensive monitoring capabilities for SLURM cluster health, job status, node states, and controller performance metrics.

The exporter integrates seamlessly with the Prometheus monitoring stack and enables observability for SLURM workloads running on Kubernetes through Soperator.

Key Features

Asynchronous metrics collection with configurable intervals (default: 30s)
Real-time monitoring of SLURM nodes, jobs, and controller performance
Prometheus-native metrics with standardized naming conventions
Rich labeling for detailed filtering and aggregation
Controller RPC diagnostics similar to SLURM's sdiag command
Kubernetes-native deployment as part of Soperator

Configuration

The SLURM Exporter can be configured using either command-line flags or environment variables. Environment variables take precedence over defaults but are overridden by explicitly provided command-line flags.

Configuration Priority

The configuration follows this priority order:

Command-line flags (highest priority) - Explicitly provided flags override all other settings
Environment variables - Used when flags are not provided
Default values (lowest priority) - Used when neither flags nor environment variables are set

Configuration Options

All configuration options can be set via command-line flags or environment variables:

Environment Variable	Flag	Description	Default
`SLURM_EXPORTER_CLUSTER_NAME`	`--cluster-name`	The name of the SLURM cluster (required)	none
`SLURM_EXPORTER_CLUSTER_NAMESPACE`	`--cluster-namespace`	The namespace of the SLURM cluster	`soperator`
`SLURM_EXPORTER_SLURM_API_SERVER`	`--slurm-api-server`	The address of the SLURM REST API server	`http://localhost:6820`
`SLURM_EXPORTER_COLLECTION_INTERVAL`	`--collection-interval`	How often to collect metrics from SLURM APIs	`30s`
`SLURM_EXPORTER_METRICS_BIND_ADDRESS`	`--metrics-bind-address`	Address for the main metrics endpoint	`:8080`
`SLURM_EXPORTER_MONITORING_BIND_ADDRESS`	`--monitoring-bind-address`	Address for the self-monitoring metrics endpoint	`:8081`
`SLURM_EXPORTER_LOG_FORMAT`	`--log-format`	Log format: `plain` or `json`	`json`
`SLURM_EXPORTER_LOG_LEVEL`	`--log-level`	Log level: `debug`, `info`, `warn`, `error`	`debug`

Exported Metrics

Node State Model

Slurm represents node state as a 32-bit integer where:

The lowest 4 bits encode 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN
Additional bits are flag bits that can be combined with base states: COMPLETING, DRAIN, MAINTENANCE, RESERVED, FAIL, PLANNED

For example, a node can be IDLE+DRAIN (idle but marked for draining) or ALLOCATED+COMPLETING (running jobs but finishing up).

Reference: https://github.com/SchedMD/slurm/blob/master/slurm/slurm.h.in

The exporter represents this as:

state_base label: The single base state (IDLE, ALLOCATED, etc.)
state_is_* labels: Boolean flags for additional state flags

Boolean state flag label convention:

Legacy flags (state_is_drain, state_is_maintenance, state_is_reserved): Use "true"/"false" for backward compatibility
New flags (state_is_completing, state_is_fail, state_is_planned): Use "true" or empty string ("") to reduce label cardinality (in Victoria Metrics, empty label value === no label, helping avoid the 30 label limit)

Core Metrics (Node and Job)

Metric Name & Type	Description & Labels
slurm_node_info Gauge	Provides detailed information about SLURM nodes Labels: • `node_name` - Name of the SLURM node • `instance_id` - Kubernetes instance identifier • `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN) • `state_is_drain` - Whether node is in drain state ("true"/"false") • `state_is_maintenance` - Whether node is in maintenance state ("true"/"false") • `state_is_reserved` - Whether node is in reserved state ("true"/"false") • `state_is_completing` - Whether node is in completing state ("true" or empty) • `state_is_fail` - Whether node is in fail state ("true" or empty) • `state_is_planned` - Whether node is in planned state ("true" or empty) • `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty) • `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty) • `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise • `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters) • `address` - IP address of the node • `reason` - Reason for current node state (empty string if node has no reason set) • `comment` - Comment set on the node (e.g., by active checks when GPU health check fails)
slurm_node_gpu_seconds_total Counter	Total GPU seconds accumulated on SLURM nodes Labels: • `node_name` - Name of the SLURM node • `state_base` - Base node state • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag
slurm_node_fails_total Counter	Total number of node state transitions to failed states (DOWN/DRAIN) Labels: • `node_name` - Name of the SLURM node • `state_base` - Base node state at time of failure • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag • `reason` - Reason for the node failure
slurm_node_unavailability_duration_seconds Histogram	Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+) Labels:* • `node_name` - Name of the SLURM node Note: Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy
slurm_node_draining_duration_seconds Histogram	Duration of completed node draining events (DRAIN+ALLOCATED or DRAIN+MIXED) Labels: • `node_name` - Name of the SLURM node Note: Observations are recorded when draining events complete. Duration tracking is reset on exporter restarts, which may affect accuracy
slurm_node_cpus_total Gauge	Total number of CPUs on the node Labels: • `node_name` - Name of the SLURM node
slurm_node_cpus_allocated Gauge	Number of CPUs currently allocated on the node Labels: • `node_name` - Name of the SLURM node
slurm_node_cpus_idle Gauge	Number of idle CPUs on the node Labels: • `node_name` - Name of the SLURM node
slurm_node_cpus_effective Gauge	Effective CPUs on the node (excluding specialized CPUs reserved for system daemons) Labels: • `node_name` - Name of the SLURM node
slurm_node_memory_total_bytes Gauge	Total memory on the node in bytes Labels: • `node_name` - Name of the SLURM node
slurm_node_memory_allocated_bytes Gauge	Allocated memory on the node in bytes Labels: • `node_name` - Name of the SLURM node
slurm_node_memory_free_bytes Gauge	Free memory on the node in bytes Labels: • `node_name` - Name of the SLURM node
slurm_node_memory_effective_bytes Gauge	Effective memory on the node in bytes (total minus specialized memory reserved for system daemons) Labels: • `node_name` - Name of the SLURM node
slurm_node_partition Gauge	Maps nodes to their partitions, enabling partition-level aggregation via PromQL joins Labels: • `node_name` - Name of the SLURM node • `partition` - Name of the SLURM partition
slurm_job_info Gauge	Detailed information about SLURM jobs Labels: • `job_id` - SLURM job identifier • `job_state` - Current job state (PENDING, RUNNING, COMPLETED, FAILED, etc.) • `job_state_reason` - Reason for current job state • `slurm_partition` - SLURM partition name • `job_name` - User-defined job name • `user_name` - Username who submitted the job • `user_id` - Numeric user ID who submitted the job • `standard_error` - Path to stderr file • `standard_output` - Path to stdout file • `array_job_id` - Array job ID (if applicable) • `array_task_id` - Array task ID (if applicable) • `submit_time` - When the job was submitted (Unix timestamp seconds, empty if not available or zero) • `start_time` - When the job started execution (Unix timestamp seconds, empty if not available or zero) • `end_time` - When the job completed (Unix timestamp seconds, empty if not available or zero). Warning: For non-terminal states like RUNNING, this may contain a future timestamp representing the forecasted end time based on the job's time limit • `finished_time` - When the job actually finished for terminal states only (Unix timestamp seconds, empty for non-terminal states or if end_time is zero). Unlike `end_time`, this field only contains actual completion times, never forecasted values
slurm_node_job Gauge	Mapping between jobs and the nodes they're running on Labels: • `job_id` - SLURM job identifier • `node_name` - Name of the node running the job
slurm_job_duration_seconds Gauge	Job duration in seconds. For running jobs, this is the time elapsed since the job started. For completed jobs, this is the total execution time. Labels: • `job_id` - SLURM job identifier Notes: • Only exported for jobs with a valid start time • For non-terminal states (RUNNING, etc.): duration = current_time - start_time • For terminal states (COMPLETED, FAILED, etc.): duration = end_time - start_time (only if end_time is valid)
slurm_job_cpus Gauge	Number of CPUs allocated to the job Labels: • `job_id` - SLURM job identifier
slurm_job_memory_bytes Gauge	Memory allocated to the job in bytes Labels: • `job_id` - SLURM job identifier

Controller RPC Metrics

These metrics provide insights into SLURM controller performance, similar to the output of the sdiag command, and were implemented to address issue #1027.

Metric Name & Type	Description & Labels
slurm_controller_rpc_calls_total Counter	Total count of RPC calls by message type Labels: • `message_type` - Type of RPC message (e.g., REQUEST_NODE_INFO, REQUEST_JOB_INFO, REQUEST_PING)
slurm_controller_rpc_duration_seconds_total Counter	Total time spent processing RPCs by message type (converted from microseconds) Labels: • `message_type` - Type of RPC message
slurm_controller_rpc_user_calls_total Counter	Total count of RPC calls by user Labels: • `user` - Username making the RPC calls • `user_id` - Numeric user ID
slurm_controller_rpc_user_duration_seconds_total Counter	Total time spent on user RPCs (converted from microseconds) Labels: • `user` - Username making the RPC calls • `user_id` - Numeric user ID
slurm_controller_server_thread_count Gauge	Number of server threads in the SLURM controller Labels: None

Self-Monitoring Metrics

The exporter provides self-monitoring metrics to track its own health and performance. These metrics are available on a separate endpoint (default port 8081) to avoid mixing operational metrics with business metrics.

Metric Name & Type	Description & Labels
slurm_exporter_collection_duration_seconds Gauge	Duration of the most recent metrics collection from SLURM APIs Labels: None
slurm_exporter_collection_attempts_total Counter	Total number of metrics collection attempts Labels: None
slurm_exporter_collection_failures_total Counter	Total number of failed metrics collection attempts Labels: None
slurm_exporter_metrics_requests_total Counter	Total number of requests to the `/metrics` endpoint Labels: None
slurm_exporter_metrics_exported Gauge	Number of metrics exported in the last scrape Labels: None

Accessing Self-Monitoring Metrics

To access self-monitoring metrics:

# Default monitoring port
curl http://localhost:8081/metrics

# Or with custom monitoring address
./soperator-exporter --monitoring-bind-address=:9090
curl http://localhost:9090/metrics

Local Development

To run the exporter locally against a cluster for debugging:

Port-forward the SLURM REST API service:

kubectl port-forward -n soperator svc/soperator-rest-svc 6820:6820

Run the exporter (it finds the JWT secret in the cluster automatically):

go run ./cmd/exporter/main.go --cluster-name=soperator --kubeconfig-path=$HOME/.kube/config

View the metrics:

curl localhost:8080/metrics

Grafana Dashboard Example

The SLURM Exporter integrates with existing Grafana dashboards. Here's an example based on the production dashboard from nebius-solutions-library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM Exporter

Overview

Key Features

Configuration

Configuration Priority

Configuration Options

Exported Metrics

Node State Model

Core Metrics (Node and Job)

Controller RPC Metrics

Self-Monitoring Metrics

Accessing Self-Monitoring Metrics

Local Development

Grafana Dashboard Example

FilesExpand file tree

slurm-exporter.md

Latest commit

History

slurm-exporter.md

File metadata and controls

SLURM Exporter

Overview

Key Features

Configuration

Configuration Priority

Configuration Options

Exported Metrics

Node State Model

Core Metrics (Node and Job)

Controller RPC Metrics

Self-Monitoring Metrics

Accessing Self-Monitoring Metrics

Local Development

Grafana Dashboard Example