Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
Warning
This repository will no longer be actively maintained starting with Slurm version 25.11, as Slurm natively integrates support for OpenMetrics metrics for Prometheus. Please consider migrating to: Slurm 25.11 metrics plugin
I developed a new Slurm exporter to simplify and improve the usage of Slurm metrics: https://github.com/sckyzo/slurm_prometheus_exporter/
β¨ Features:
- β Export Native OpenMetrics from Slurm (version 25.11+)
- β Support for multiple endpoints (jobs, jobs-users-accts, nodes, partitions, scheduler)
- β Basic Authentication and SSL/TLS support
- β Customizable global labels for all metrics
- β Easy configuration with YAML
- β Built with Clean Architecture principles
- β Comprehensive error handling and logging
- Prometheus Slurm Exporter π
- β Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
- β All metric collectors are optional and can be enabled/disabled via flags.
- β Supports TLS and Basic Authentication for secure connections.
- β Ready-to-use Grafana dashboard.
There are two recommended ways to install the Slurm Exporter.
This is the easiest method for most users.
-
Download the latest release for your OS and architecture from the GitHub Releases page. π₯
-
Place the
slurm_exporterbinary in a suitable location on a node with Slurm CLI access, such as/usr/local/bin/. -
Ensure the binary is executable:
chmod +x /usr/local/bin/slurm_exporter
-
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
-
Copy it to
/etc/systemd/system/slurm_exporter.serviceand customize it for your environment (especially theExecStartpath). -
Reload the Systemd daemon, then enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable slurm_exporter sudo systemctl start slurm_exporter
-
If you want to build the exporter yourself, you can do so using the provided Makefile. π©βπ»
-
Clone the repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter -
Build the binary:
make build
-
The new binary will be available at
bin/slurm_exporter. You can then copy it to a location like/usr/local/bin/and set up the Systemd service as described in the section above.
The exporter can be configured using command-line flags.
Basic execution:
./slurm_exporter --web.listen-address=":9341"Using a configuration file for web settings (TLS/Basic Auth):
./slurm_exporter --web.config.file=/path/to/web-config.ymlFor details on the web-config.yml format, see the Exporter Toolkit documentation.
View help and all available options:
./slurm_exporter --help| Flag | Description | Default |
|---|---|---|
--web.listen-address |
Address to listen on for web interface and telemetry | :9341 |
--web.config.file |
Path to configuration file for TLS/Basic Auth | (none) |
--command.timeout |
Timeout for executing Slurm commands | 5s |
--log.level |
Log level: debug, info, warn, error |
info |
--log.format |
Log format: json, text |
text |
--collector.<name> |
Enable the specified collector | true (all enabled by default) |
--no-collector.<name> |
Disable the specified collector | (none) |
Available collectors: accounts, cpus, fairshare, gpus, info, node, nodes, partitions, queue, reservations, scheduler, users, licenses
By default, all collectors are enabled.
You can control which collectors are active using the --collector.<name> and --no-collector.<name> flags.
Example: Disable the scheduler and partitions collectors
./slurm_exporter --no-collector.scheduler --no-collector.partitionsExample: Disable the gpus collector
./slurm_exporter --no-collector.gpusExample: Run only the nodes and cpus collectors
This requires disabling all other collectors individually.
./slurm_exporter \
--no-collector.accounts \
--no-collector.fairshare \
--no-collector.gpus \
--no-collector.node \
--no-collector.partitions \
--no-collector.queue \
--no-collector.reservations \
--no-collector.scheduler \
--no-collector.info \
--no-collector.usersExample: Custom timeout and logging
./slurm_exporter \
--command.timeout=10s \
--log.level=debug \
--log.format=jsonThis project requires access to a node with the Slurm CLI (sinfo, squeue, sdiag, etc.).
- Go (version 1.22 or higher recommended)
- Slurm CLI tools available in your
$PATH
-
Clone this repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter -
Build the exporter binary:
make build
The binary will be available in
bin/slurm_exporter.
To run all tests:
make testClean build artifacts:
make cleanRun the exporter locally:
bin/slurm_exporter --web.listen-address=:8080Query metrics:
curl http://localhost:8080/metricsAdvanced build options: You can override the Go version and architecture via environment variables:
make build GO_VERSION=1.22.2 OS=linux ARCH=amd64The exporter provides a wide range of metrics, each collected by a specific, toggleable collector.
Provides job statistics aggregated by Slurm account.
- Command:
squeue -a -r -h -o "%A|%a|%T|%C"
| Metric | Description | Labels |
|---|---|---|
slurm_account_jobs_pending |
Pending jobs for account | account |
slurm_account_jobs_running |
Running jobs for account | account |
slurm_account_cpus_running |
Running cpus for account | account |
slurm_account_jobs_suspended |
Suspended jobs for account | account |
Provides global statistics on CPU states for the entire cluster.
- Command:
sinfo -h -o "%C"
| Metric | Description | Labels |
|---|---|---|
slurm_cpus_alloc |
Allocated CPUs | (none) |
slurm_cpus_idle |
Idle CPUs | (none) |
slurm_cpus_other |
Mix CPUs | (none) |
slurm_cpus_total |
Total CPUs | (none) |
Reports the calculated fairshare factor for each account.
- Command:
sshare -n -P -o "account,fairshare"
| Metric | Description | Labels |
|---|---|---|
slurm_account_fairshare |
FairShare for account | account |
Provides global statistics on GPU states for the entire cluster.
β οΈ Note: This collector is enabled by default. Disable it with--no-collector.gpusif not needed.
- Command:
sinfo(with various formats)
| Metric | Description | Labels |
|---|---|---|
slurm_gpus_alloc |
Allocated GPUs | (none) |
slurm_gpus_idle |
Idle GPUs | (none) |
slurm_gpus_other |
Other GPUs | (none) |
slurm_gpus_total |
Total GPUs | (none) |
slurm_gpus_utilization |
Total GPU utilization | (none) |
Exposes the version of Slurm and the availability of different Slurm binaries.
- Command:
<binary> --version
| Metric | Description | Labels |
|---|---|---|
slurm_info |
Information on Slurm version and binaries | type, binary, version |
Provides metrics on license counts and usage.
- Command:
scontrol show licenses -o
| Metric | Description | Labels |
|---|---|---|
slurm_license_total |
Total count for license | license |
slurm_license_used |
Used count for license | license |
slurm_license_free |
Free count for license | license |
Provides detailed, per-node metrics for CPU and memory usage.
- Command:
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong,Partition"
| Metric | Description | Labels |
|---|---|---|
slurm_node_cpu_alloc |
Allocated CPUs per node | node, status, partition |
slurm_node_cpu_idle |
Idle CPUs per node | node, status, partition |
slurm_node_cpu_other |
Other CPUs per node | node, status, partition |
slurm_node_cpu_total |
Total CPUs per node | node, status, partition |
slurm_node_mem_alloc |
Allocated memory per node | node, status, partition |
slurm_node_mem_total |
Total memory per node | node, status, partition |
slurm_node_status |
Node Status with partition (1 if up) | node, status, partition |
Provides aggregated metrics on node states for the cluster.
- Commands:
sinfo -h -o "%D|%T|%b",scontrol show nodes -o
| Metric | Description | Labels |
|---|---|---|
slurm_nodes_alloc |
Allocated nodes | partition, active_feature_set |
slurm_nodes_comp |
Completing nodes | partition, active_feature_set |
slurm_nodes_down |
Down nodes | partition, active_feature_set |
slurm_nodes_drain |
Drain nodes | partition, active_feature_set |
slurm_nodes_err |
Error nodes | partition, active_feature_set |
slurm_nodes_fail |
Fail nodes | partition, active_feature_set |
slurm_nodes_idle |
Idle nodes | partition, active_feature_set |
slurm_nodes_inval |
Inval nodes | partition, active_feature_set |
slurm_nodes_maint |
Maint nodes | partition, active_feature_set |
slurm_nodes_mix |
Mix nodes | partition, active_feature_set |
slurm_nodes_resv |
Reserved nodes | partition, active_feature_set |
slurm_nodes_other |
Nodes reported with an unknown state | partition, active_feature_set |
slurm_nodes_planned |
Planned nodes | partition, active_feature_set |
slurm_nodes_total |
Total number of nodes | (none) |
Provides metrics on CPU usage and pending jobs for each partition.
- Commands:
sinfo -h -o "%R,%C",squeue -a -r -h -o "%P" --states=PENDING
| Metric | Description | Labels |
|---|---|---|
slurm_partition_cpus_allocated |
Allocated CPUs for partition | partition |
slurm_partition_cpus_idle |
Idle CPUs for partition | partition |
slurm_partition_cpus_other |
Other CPUs for partition | partition |
slurm_partition_cpus_total |
Total CPUs for partition | partition |
slurm_partition_jobs_pending |
Pending jobs for partition | partition |
slurm_partition_jobs_running |
Running jobs for partition | partition |
slurm_partition_gpus_idle |
Idle GPUs for partition | partition |
slurm_partition_gpus_allocated |
Allocated GPUs for partition | partition |
Provides detailed metrics on job states and resource usage.
- Command:
squeue -h -o "%P,%T,%C,%r,%u"
| Metric | Description | Labels |
|---|---|---|
slurm_queue_pending |
Pending jobs in queue | user, partition, reason |
slurm_queue_running |
Running jobs in the cluster | user, partition |
slurm_queue_suspended |
Suspended jobs in the cluster | user, partition |
slurm_cores_pending |
Pending cores in queue | user, partition, reason |
slurm_cores_running |
Running cores in the cluster | user, partition |
... |
(and many other states: completed, failed, etc.) |
user, partition |
Provides metrics about active Slurm reservations.
- Command:
scontrol show reservation
| Metric | Description | Labels |
|---|---|---|
slurm_reservation_info |
A metric with a constant '1' value labeled by reservation details | reservation_name, state, users, nodes, partition, flags |
slurm_reservation_start_time_seconds |
The start time of the reservation in seconds since the Unix epoch | reservation_name |
slurm_reservation_end_time_seconds |
The end time of the reservation in seconds since the Unix epoch | reservation_name |
slurm_reservation_node_count |
The number of nodes allocated to the reservation | reservation_name |
slurm_reservation_core_count |
The number of cores allocated to the reservation | reservation_name |
Provides internal performance metrics from the slurmctld daemon.
- Command:
sdiag
| Metric | Description | Labels |
|---|---|---|
slurm_scheduler_threads |
Number of scheduler threads | (none) |
slurm_scheduler_queue_size |
Length of the scheduler queue | (none) |
slurm_scheduler_mean_cycle |
Scheduler mean cycle time (microseconds) | (none) |
slurm_rpc_stats |
RPC count statistic | operation |
slurm_user_rpc_stats |
RPC count statistic per user | user |
... |
(and many other backfill and RPC time metrics) | operation or user |
Provides job statistics aggregated by user.
- Command:
squeue -a -r -h -o "%A|%u|%T|%C"
| Metric | Description | Labels |
|---|---|---|
slurm_user_jobs_pending |
Pending jobs for user | user |
slurm_user_jobs_running |
Running jobs for user | user |
slurm_user_cpus_running |
Running cpus for user | user |
slurm_user_jobs_suspended |
Suspended jobs for user | user |
scrape_configs:
- job_name: 'slurm_exporter'
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ['slurm_host.fqdn:9341']- scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
- scrape_timeout: Should be equal to or less than the
scrape_intervalto preventcontext_deadline_exceedederrors.
Check config:
promtool check-config prometheus.yml-
Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:
./slurm_exporter --command.timeout=10s
-
Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.
-
Collector Selection: Disable unused collectors to reduce load and improve performance:
./slurm_exporter --no-collector.fairshare --no-collector.reservations
A Grafana dashboard is available:
This project is licensed under the GNU General Public License, version 3 or later.
This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).
Feel free to contribute or open issues!



