-
Notifications
You must be signed in to change notification settings - Fork 23
slurm_feature
The main purpose of the Slurm Appliance set (Controller and Worker) is to provide a rapidly deployable, self-configuring cluster environment. The Slurm Controller acts as the central orchestrator, managing resources and job queues, while Slurm Workers are compute nodes that dynamically join the cluster to execute workloads.
A standout feature of this appliance is its Configless Slurm architecture.
-
Centralized Config: The slurm.conf configuration file resides only on the Controller node.
-
Auto-Propagation: When Worker nodes boot up, they pull the configuration directly from the Controller.
-
Zero-Copy: Users do not need to manually copy configuration files to worker nodes or rebuild images when configuration changes.
OneGate Integration & Munge Key Exchange
To ensure secure communication, the cluster relies on a shared Munge Key.
-
Exposure: The Controller generates this key and exposes it via OneGate.
-
Manual Transfer: While the key is exposed automatically, it must be manually provided to the Workers during instantiation.
-
Encryption: Currently, Munge is the only supported method for authentication and encryption within the cluster.
Cluster Management Commands
The primary tool for monitoring the status of the cluster and verifying that workers have successfully joined is scontrol.
Access the management tool by SSHing into the Slurm Controller VM:
$ onevm ssh <SlurmController_ID>Once connected to the Controller, use the following command to list all registered worker nodes and their states:
$ scontrol show nodesBecause the Munge key is critical for cluster security, it requires a manual transfer step during deployment.
Please be aware of the current operational constraints of this appliance:
-
Stateless Mode (No SlurmDB): The appliance does not currently include the Slurm Database Daemon (slurmdbd). It operates in stateless mode, meaning job accounting, history, and usage statistics are not persisted.
-
Encryption Support: The cluster supports Munge exclusively. Other authentication or encryption mechanisms are not currently configured.
The Slurm cluster deployment follows a specific three-step flow:
-
Controller Initialization (Service Setup)
-
Deploy the Slurm Controller appliance.
-
Wait for the "All set and ready to serve" confirmation via SSH.
-
The Controller automatically generates the credentials and pushes them to OneGate.
-
-
Worker Instantiation (Compute Scaling)
-
Retrieve the Munge Key from the Controller using the OpenNebula CLI.
-
Instantiate Slurm Worker appliances.
-
Crucial Step: Manually input the Controller IP and paste the Munge Key into the context variables.
-
There is no limit to the number of workers; you can instantiate as many as required by your workload.
-
-
Cluster Convergence (Auto-Join)
-
As Worker VMs boot, they automatically fetch the configuration from the Controller (Configless mode).
-
After a few minutes, the Workers will automatically register themselves with the Controller.
-
The cluster is now ready to accept job submissions.
-
- OpenNebula Apps Overview
- OS Appliances Update Policy
- OneApps Quick Intro
- Build Instructions
- Linux Contextualization Packages
- Windows Contextualization Packages
- OneKE (OpenNebula Kubernetes Edition)
- Virtual Router
- Overview & Release Notes
- Quick Start
- OpenRC Services
- Virtual Router Modules
- Glossary
- WordPress
- Harbor Container Registry
- MinIO
- vLLM AI
- NVIDIA Fabric Manager
- Rancher CAPI
- Development