Prometheus Slurm Exporter 🚀

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

Warning

This repository will no longer be actively maintained starting with Slurm version 25.11, as Slurm natively integrates support for OpenMetrics metrics for Prometheus. Please consider migrating to: Slurm 25.11 metrics plugin

I developed a new Slurm exporter to simplify and improve the usage of Slurm metrics: https://github.com/sckyzo/slurm_prometheus_exporter/

✨ Features:

✅ Export Native OpenMetrics from Slurm (version 25.11+)
✅ Support for multiple endpoints (jobs, jobs-users-accts, nodes, partitions, scheduler)
✅ Basic Authentication and SSL/TLS support
✅ Customizable global labels for all metrics
✅ Easy configuration with YAML
✅ Built with Clean Architecture principles
✅ Comprehensive error handling and logging

📋 Table of Contents

Prometheus Slurm Exporter 🚀

✨ Features

✅ Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
✅ All metric collectors are optional and can be enabled/disabled via flags.
✅ Supports TLS and Basic Authentication for secure connections.
✅ Ready-to-use Grafana dashboard.

📦 Installation

There are two recommended ways to install the Slurm Exporter.

1. From Pre-compiled Releases

This is the easiest method for most users.

Download the latest release for your OS and architecture from the GitHub Releases page. 📥
Place the slurm_exporter binary in a suitable location on a node with Slurm CLI access, such as /usr/local/bin/.
Ensure the binary is executable:
```
chmod +x /usr/local/bin/slurm_exporter
```
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
- Copy it to /etc/systemd/system/slurm_exporter.service and customize it for your environment (especially the ExecStart path).
- Reload the Systemd daemon, then enable and start the service:
```
sudo systemctl daemon-reload
sudo systemctl enable slurm_exporter
sudo systemctl start slurm_exporter
```

2. From Source

If you want to build the exporter yourself, you can do so using the provided Makefile. 👩‍💻

Clone the repository:

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter

Build the binary:
```
make build
```
The new binary will be available at bin/slurm_exporter. You can then copy it to a location like /usr/local/bin/ and set up the Systemd service as described in the section above.

⚙️ Usage

The exporter can be configured using command-line flags.

Basic execution:

./slurm_exporter --web.listen-address=":9341"

Using a configuration file for web settings (TLS/Basic Auth):

./slurm_exporter --web.config.file=/path/to/web-config.yml

For details on the web-config.yml format, see the Exporter Toolkit documentation.

View help and all available options:

./slurm_exporter --help

Command-Line Options

Flag	Description	Default
`--web.listen-address`	Address to listen on for web interface and telemetry	`:9341`
`--web.config.file`	Path to configuration file for TLS/Basic Auth	(none)
`--command.timeout`	Timeout for executing Slurm commands	`5s`
`--log.level`	Log level: `debug`, `info`, `warn`, `error`	`info`
`--log.format`	Log format: `json`, `text`	`text`
`--collector.<name>`	Enable the specified collector	`true` (all enabled by default)
`--no-collector.<name>`	Disable the specified collector	(none)

Available collectors: accounts, cpus, fairshare, gpus, info, node, nodes, partitions, queue, reservations, scheduler, users, licenses

Enabling and Disabling Collectors

By default, all collectors are enabled.

You can control which collectors are active using the --collector.<name> and --no-collector.<name> flags.

Example: Disable the scheduler and partitions collectors

./slurm_exporter --no-collector.scheduler --no-collector.partitions

Example: Disable the gpus collector

./slurm_exporter --no-collector.gpus

Example: Run only the nodes and cpus collectors

This requires disabling all other collectors individually.

./slurm_exporter \
  --no-collector.accounts \
  --no-collector.fairshare \
  --no-collector.gpus \
  --no-collector.node \
  --no-collector.partitions \
  --no-collector.queue \
  --no-collector.reservations \
  --no-collector.scheduler \
  --no-collector.info \
  --no-collector.users

Example: Custom timeout and logging

./slurm_exporter \
  --command.timeout=10s \
  --log.level=debug \
  --log.format=json

🛠️ Development

This project requires access to a node with the Slurm CLI (sinfo, squeue, sdiag, etc.).

Prerequisites

Go (version 1.22 or higher recommended)
Slurm CLI tools available in your $PATH

Building from Source

Clone this repository:

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter

Build the exporter binary:
```
make build
```
The binary will be available in bin/slurm_exporter.

Running Tests

To run all tests:

make test

Development Commands

Clean build artifacts:

make clean

Run the exporter locally:

bin/slurm_exporter --web.listen-address=:8080

Query metrics:

curl http://localhost:8080/metrics

Advanced build options: You can override the Go version and architecture via environment variables:

make build GO_VERSION=1.22.2 OS=linux ARCH=amd64

📊 Metrics

The exporter provides a wide range of metrics, each collected by a specific, toggleable collector.

`accounts` Collector

Provides job statistics aggregated by Slurm account.

Command: squeue -a -r -h -o "%A|%a|%T|%C"

Metric	Description	Labels
`slurm_account_jobs_pending`	Pending jobs for account	`account`
`slurm_account_jobs_running`	Running jobs for account	`account`
`slurm_account_cpus_running`	Running cpus for account	`account`
`slurm_account_jobs_suspended`	Suspended jobs for account	`account`

`cpus` Collector

Provides global statistics on CPU states for the entire cluster.

Command: sinfo -h -o "%C"

Metric	Description	Labels
`slurm_cpus_alloc`	Allocated CPUs	(none)
`slurm_cpus_idle`	Idle CPUs	(none)
`slurm_cpus_other`	Mix CPUs	(none)
`slurm_cpus_total`	Total CPUs	(none)

`fairshare` Collector

Reports the calculated fairshare factor for each account.

Command: sshare -n -P -o "account,fairshare"

Metric	Description	Labels
`slurm_account_fairshare`	FairShare for account	`account`

`gpus` Collector

Provides global statistics on GPU states for the entire cluster.

⚠️ Note: This collector is enabled by default. Disable it with --no-collector.gpus if not needed.

Command: sinfo (with various formats)

Metric	Description	Labels
`slurm_gpus_alloc`	Allocated GPUs	(none)
`slurm_gpus_idle`	Idle GPUs	(none)
`slurm_gpus_other`	Other GPUs	(none)
`slurm_gpus_total`	Total GPUs	(none)
`slurm_gpus_utilization`	Total GPU utilization	(none)

`info` Collector

Exposes the version of Slurm and the availability of different Slurm binaries.

Command: <binary> --version

Metric	Description	Labels
`slurm_info`	Information on Slurm version and binaries	`type`, `binary`, `version`

`licenses` Collector

Provides metrics on license counts and usage.

Command: scontrol show licenses -o

Metric	Description	Labels
`slurm_license_total`	Total count for license	`license`
`slurm_license_used`	Used count for license	`license`
`slurm_license_free`	Free count for license	`license`

`node` Collector

Provides detailed, per-node metrics for CPU and memory usage.

Command: sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong,Partition"

Metric	Description	Labels
`slurm_node_cpu_alloc`	Allocated CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_idle`	Idle CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_other`	Other CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_total`	Total CPUs per node	`node`, `status`, `partition`
`slurm_node_mem_alloc`	Allocated memory per node	`node`, `status`, `partition`
`slurm_node_mem_total`	Total memory per node	`node`, `status`, `partition`
`slurm_node_status`	Node Status with partition (1 if up)	`node`, `status`, `partition`

`nodes` Collector

Provides aggregated metrics on node states for the cluster.

Commands: sinfo -h -o "%D|%T|%b", scontrol show nodes -o

Metric	Description	Labels
`slurm_nodes_alloc`	Allocated nodes	`partition`, `active_feature_set`
`slurm_nodes_comp`	Completing nodes	`partition`, `active_feature_set`
`slurm_nodes_down`	Down nodes	`partition`, `active_feature_set`
`slurm_nodes_drain`	Drain nodes	`partition`, `active_feature_set`
`slurm_nodes_err`	Error nodes	`partition`, `active_feature_set`
`slurm_nodes_fail`	Fail nodes	`partition`, `active_feature_set`
`slurm_nodes_idle`	Idle nodes	`partition`, `active_feature_set`
`slurm_nodes_inval`	Inval nodes	`partition`, `active_feature_set`
`slurm_nodes_maint`	Maint nodes	`partition`, `active_feature_set`
`slurm_nodes_mix`	Mix nodes	`partition`, `active_feature_set`
`slurm_nodes_resv`	Reserved nodes	`partition`, `active_feature_set`
`slurm_nodes_other`	Nodes reported with an unknown state	`partition`, `active_feature_set`
`slurm_nodes_planned`	Planned nodes	`partition`, `active_feature_set`
`slurm_nodes_total`	Total number of nodes	(none)

`partitions` Collector

Provides metrics on CPU usage and pending jobs for each partition.

Commands: sinfo -h -o "%R,%C", squeue -a -r -h -o "%P" --states=PENDING

Metric	Description	Labels
`slurm_partition_cpus_allocated`	Allocated CPUs for partition	`partition`
`slurm_partition_cpus_idle`	Idle CPUs for partition	`partition`
`slurm_partition_cpus_other`	Other CPUs for partition	`partition`
`slurm_partition_cpus_total`	Total CPUs for partition	`partition`
`slurm_partition_jobs_pending`	Pending jobs for partition	`partition`
`slurm_partition_jobs_running`	Running jobs for partition	`partition`
`slurm_partition_gpus_idle`	Idle GPUs for partition	`partition`
`slurm_partition_gpus_allocated`	Allocated GPUs for partition	`partition`

`queue` Collector

Provides detailed metrics on job states and resource usage.

Command: squeue -h -o "%P,%T,%C,%r,%u"

Metric	Description	Labels
`slurm_queue_pending`	Pending jobs in queue	`user`, `partition`, `reason`
`slurm_queue_running`	Running jobs in the cluster	`user`, `partition`
`slurm_queue_suspended`	Suspended jobs in the cluster	`user`, `partition`
`slurm_cores_pending`	Pending cores in queue	`user`, `partition`, `reason`
`slurm_cores_running`	Running cores in the cluster	`user`, `partition`
`...`	(and many other states: `completed`, `failed`, etc.)	`user`, `partition`

`reservations` Collector

Provides metrics about active Slurm reservations.

Command: scontrol show reservation

Metric	Description	Labels
`slurm_reservation_info`	A metric with a constant '1' value labeled by reservation details	`reservation_name`, `state`, `users`, `nodes`, `partition`, `flags`
`slurm_reservation_start_time_seconds`	The start time of the reservation in seconds since the Unix epoch	`reservation_name`
`slurm_reservation_end_time_seconds`	The end time of the reservation in seconds since the Unix epoch	`reservation_name`
`slurm_reservation_node_count`	The number of nodes allocated to the reservation	`reservation_name`
`slurm_reservation_core_count`	The number of cores allocated to the reservation	`reservation_name`

`scheduler` Collector

Provides internal performance metrics from the slurmctld daemon.

Command: sdiag

Metric	Description	Labels
`slurm_scheduler_threads`	Number of scheduler threads	(none)
`slurm_scheduler_queue_size`	Length of the scheduler queue	(none)
`slurm_scheduler_mean_cycle`	Scheduler mean cycle time (microseconds)	(none)
`slurm_rpc_stats`	RPC count statistic	`operation`
`slurm_user_rpc_stats`	RPC count statistic per user	`user`
`...`	(and many other backfill and RPC time metrics)	`operation` or `user`

`users` Collector

Provides job statistics aggregated by user.

Command: squeue -a -r -h -o "%A|%u|%T|%C"

Metric	Description	Labels
`slurm_user_jobs_pending`	Pending jobs for user	`user`
`slurm_user_jobs_running`	Running jobs for user	`user`
`slurm_user_cpus_running`	Running cpus for user	`user`
`slurm_user_jobs_suspended`	Suspended jobs for user	`user`

📡 Prometheus Configuration

scrape_configs:
  - job_name: 'slurm_exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['slurm_host.fqdn:9341']

scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
scrape_timeout: Should be equal to or less than the scrape_interval to prevent context_deadline_exceeded errors.

Check config:

promtool check-config prometheus.yml

Performance Considerations

Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:
```
./slurm_exporter --command.timeout=10s
```
Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.
Collector Selection: Disable unused collectors to reduce load and improve performance:
```
./slurm_exporter --no-collector.fairshare --no-collector.reservations
```

📈 Grafana Dashboard

A Grafana dashboard is available:

📜 License

This project is licensed under the GNU General Public License, version 3 or later.

🍴 About this fork

This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Feel free to contribute or open issues!

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
.github		.github
cmd/slurm_exporter		cmd/slurm_exporter
images		images
internal		internal
systemd		systemd
test_data		test_data
.gitignore		.gitignore
.goreleaser.dev.yaml		.goreleaser.dev.yaml
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Prometheus Slurm Exporter 🚀

📋 Table of Contents

✨ Features

📦 Installation

1. From Pre-compiled Releases

2. From Source

⚙️ Usage

Command-Line Options

Enabling and Disabling Collectors

🛠️ Development

Prerequisites

Building from Source

Running Tests

Development Commands

📊 Metrics

accounts Collector

cpus Collector

fairshare Collector

gpus Collector

info Collector

licenses Collector

node Collector

nodes Collector

partitions Collector

queue Collector

reservations Collector

scheduler Collector

users Collector

📡 Prometheus Configuration

Performance Considerations

📈 Grafana Dashboard

📜 License

🍴 About this fork

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`accounts` Collector

`cpus` Collector

`fairshare` Collector

`gpus` Collector

`info` Collector

`licenses` Collector

`node` Collector

`nodes` Collector

`partitions` Collector

`queue` Collector

`reservations` Collector

`scheduler` Collector

`users` Collector

Packages