stackhpc
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 0 additions & 1 deletion b/‎.github/workflows/ci.yml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 222 additions & 52 deletions b/‎README.md‎
Lines changed: 222 additions & 52 deletions
diff --git a/‎defaults/main.yml‎
Lines changed: 8 additions & 1 deletion b/‎defaults/main.yml‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎handlers/main.yml‎
Lines changed: 0 additions & 6 deletions b/‎handlers/main.yml‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎molecule/README.md‎
Lines changed: 2 additions & 2 deletions b/‎molecule/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎molecule/test1/converge.yml‎
Lines changed: 1 addition & 1 deletion b/‎molecule/test1/converge.yml‎
Lines changed: 1 addition & 1 deletion
@@ -59,7 +59,6 @@ jobs:
           - test11
           - test12
           - test13
-          - test14
         exclude:
           # mariadb package provides /usr/bin/mysql on RL8 which doesn't work with geerlingguy/mysql role
           - scenario: test4
 
@@ -59,15 +59,20 @@ unique set of homogenous nodes:
   `free --mebi` total * `openhpc_ram_multiplier`.
   * `ram_multiplier`: Optional.  An override for the top-level definition
   `openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
-  * `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
+  * `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the  generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. See [GRES autodetection](#gres-autodetection) section below.
+  * `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
       - `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
-      - `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
+      - `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
+
     Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
-  * `params`: Optional. Mapping of additional parameters and values for
+  * `features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
+  * `node_params`: Optional. Mapping of additional parameters and values for
   [node configuration](https://slurm.schedmd.com/slurm.conf.html#lbAE).
+  **NB:** Parameters which can be set via the keys above must not be included here.
 
   Each nodegroup will contain hosts from an Ansible inventory group named
-  `{{ openhpc_cluster_name }}_{{ group_name}}`. Note that:
+  `{{ openhpc_cluster_name }}_{{ name }}`, where `name` is the nodegroup name.
+  Note that:
   - Each host may only appear in one nodegroup.
   - Hosts in a nodegroup are assumed to be homogenous in terms of processor and memory.
   - Hosts may have arbitrary hostnames, but these should be lowercase to avoid a
@@ -78,18 +83,23 @@ unique set of homogenous nodes:
     This is used to set `Sockets`, `CoresPerSocket`, `ThreadsPerCore` and
     optionally `RealMemory` for the nodegroup.
 
-`openhpc_partitions`: Optional, default `[]`. List of mappings, each defining a
+`openhpc_partitions`: Optional. List of mappings, each defining a
 partition. Each partition mapping may contain:
   * `name`: Required. Name of partition.
-  * `groups`: Optional. List of nodegroup names. If omitted, the partition name
-  is assumed to match a nodegroup name.
+  * `nodegroups`: Optional. List of node group names. If omitted, the node group
+     with the same name as the partition is used.
   * `default`: Optional.  A boolean flag for whether this partion is the default.  Valid settings are `YES` and `NO`.
-  * `maxtime`: Optional.  A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`.  The default value is
-  given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
-  * `params`: Optional. Mapping of additional parameters and values for
+  * `maxtime`: Optional.  A partition-specific time limit overriding `openhpc_job_maxtime`.
+  * `partition_params`: Optional. Mapping of additional parameters and values for
   [partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION).
+  **NB:** Parameters which can be set via the keys above must not be included here.
+
+If this variable is not set one partition per nodegroup is created, with default
+partition configuration for each.
 
-`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days). See [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime` for format. The default is 60 days. The value should be quoted to avoid Ansible conversions.
+`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see
+[slurm.conf:MaxTime](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
+**NB:** This should be quoted to avoid Ansible conversions.
 
 `openhpc_cluster_name`: name of the cluster.
 
@@ -140,10 +150,12 @@ accounting data such as start and end times. By default no job accounting is con
 `openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of
 `openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk.
 
-### slurmdbd.conf
+### slurmdbd
 
-The following options affect `slurmdbd.conf`. Please see the slurm [documentation](https://slurm.schedmd.com/slurmdbd.conf.html) for more details.
-You will need to configure these variables if you have set `openhpc_enable.database` to `true`.
+When the slurm database daemon (`slurmdbd`) is enabled by setting
+`openhpc_enable.database` to `true` the following options must be configured.
+See documentation for [slurmdbd.conf](https://slurm.schedmd.com/slurmdbd.conf.html)
+for more details.
 
 `openhpc_slurmdbd_port`: Port for slurmdb to listen on, defaults to `6819`.
 
@@ -155,6 +167,30 @@ You will need to configure these variables if you have set `openhpc_enable.datab
 
 `openhpc_slurmdbd_mysql_username`: Username for authenticating with the database, defaults to `slurm`.
 
+Before starting `slurmdbd`, the role will check if a database upgrade is
+required to due to a Slurm major version upgrade and carry it out if so.
+Slurm versions before 24.11 do not support this check and so no upgrade will
+occur. The following variables control behaviour during this upgrade:
+
+`openhpc_slurm_accounting_storage_client_package`: Optional. String giving the
+name of the database client package to install, e.g. `mariadb`. Default `mysql`.
+
+`openhpc_slurm_accounting_storage_backup_cmd`: Optional. String (possibly
+multi-line) giving a command for `ansible.builtin.shell` to run a backup of the
+Slurm database before performing the databse upgrade. Default is the empty
+string which performs no backup.
+
+`openhpc_slurm_accounting_storage_backup_host`: Optional. Inventory hostname
+defining host to run the backup command. Default is `openhpc_slurm_accounting_storage_host`.
+
+`openhpc_slurm_accounting_storage_backup_become`: Optional. Whether to run the
+backup command as root. Default `true`.
+
+`openhpc_slurm_accounting_storage_service`: Optional. Name of systemd service
+for the accounting storage database, e.g. `mysql`. If this is defined this
+service is stopped before the backup and restarted after, to allow for physical
+backups. Default is the empty string, which does not stop/restart any service.
+
 ## Facts
 
 This role creates local facts from the live Slurm configuration, which can be
@@ -163,50 +199,184 @@ accessed (with facts gathering enabled) using `ansible_local.slurm`. As per the
 in mixed case are from from config files. Note the facts are only refreshed
 when this role is run.
 
-## Example Inventory
-
-And an Ansible inventory as this:
-
-    [openhpc_login]
-    openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos
-
-    [openhpc_compute]
-    openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
-    openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos
+## Example
 
-    [cluster_login:children]
-    openhpc_login
+### Simple
 
-    [cluster_control:children]
-    openhpc_login
+The following creates a cluster with a a single partition `compute`
+containing two nodes:
 
-    [cluster_batch:children]
-    openhpc_compute
+```ini
+# inventory/hosts:
+[hpc_login]
+cluster-login-0
 
-## Example Playbooks
+[hpc_compute]
+cluster-compute-0
+cluster-compute-1
 
-To deploy, create a playbook which looks like this:
-
-    ---
-    - hosts:
-      - cluster_login
-      - cluster_control
-      - cluster_batch
-      become: yes
-      roles:
-        - role: openhpc
-          openhpc_enable:
-            control: "{{ inventory_hostname in groups['cluster_control'] }}"
-            batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
-            runtime: true
-          openhpc_slurm_service_enabled: true
-          openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
-          openhpc_slurm_partitions:
-            - name: "compute"
-          openhpc_cluster_name: openhpc
-          openhpc_packages: []
-    ...
+[hpc_control]
+cluster-control
+```
 
+```yaml
+#playbook.yml
+---
+- hosts: all
+  become: yes
+  tasks:
+    - import_role:
+        name: stackhpc.openhpc
+      vars:
+        openhpc_cluster_name: hpc
+        openhpc_enable:
+          control: "{{ inventory_hostname in groups['cluster_control'] }}"
+          batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
+          runtime: true
+        openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
+        openhpc_nodegroups:
+          - name: compute
+        openhpc_partitions:
+          - name: compute
 ---
+```
+
+### Multiple nodegroups
+
+This example shows how partitions can span multiple types of compute node.
+
+This example inventory describes three types of compute node (login and
+control nodes are omitted for brevity):
+
+```ini
+# inventory/hosts:
+...
+[hpc_general]
+# standard compute nodes
+cluster-general-0
+cluster-general-1
+
+[hpc_large]
+# large memory nodes
+cluster-largemem-0
+cluster-largemem-1
+
+[hpc_gpu]
+# GPU nodes
+cluster-a100-0
+cluster-a100-1
+...
+```
+
+Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
+apply any node-level parameters - in this case the `largemem` nodes have
+2x cores reserved for some reason, and GRES is configured for the GPU nodes:
+
+```yaml
+openhpc_cluster_name: hpc
+openhpc_nodegroups:
+  - name: general
+  - name: large
+    node_params:
+      CoreSpecCount: 2
+  - name: gpu
+    gres:
+      - conf: gpu:A100:2
+        file: /dev/nvidia[0-1]
+```
+or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
+
+```yaml
+openhpc_cluster_name: hpc
+openhpc_nodegroups:
+  - name: general
+  - name: large
+    node_params:
+      CoreSpecCount: 2
+  - name: gpu
+    gres_autodetect: nvml
+    gres:
+      - conf: gpu:A100:2
+```
+Now two partitions can be configured - a default one with a short timelimit and
+no large memory nodes for testing jobs, and another with all hardware and longer
+job runtime for "production" jobs:
+
+```yaml
+openhpc_partitions:
+  - name: test
+    nodegroups:
+      - general
+      - gpu
+    maxtime: '1:0:0' # 1 hour
+    default: 'YES'
+  - name: general
+    nodegroups:
+      - general
+      - large
+      - gpu
+    maxtime: '2-0' # 2 days
+    default: 'NO'
+```
+Users will select the partition using `--partition` argument and request nodes
+with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
+options for `sbatch` or `srun`.
+
+Finally here some additional configuration must be provided for GRES:
+```yaml
+openhpc_config:
+  GresTypes:
+    -gpu
+```
+
+## GRES autodetection
+
+Some autodetection mechanisms require recompilation of the slurm packages to
+link against external libraries. Examples are shown in the sections below.
+
+### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
+
+This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
+definitions.
+
+First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
+You can then recompile the slurm packages from the source RPMS as follows:
+
+```sh
+dnf download --source slurm-slurmd-ohpc
+
+rpm -i slurm-ohpc-*.src.rpm
+
+cd /root/rpmbuild/SPECS
+
+dnf builddep slurm.spec
+
+rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
+```
+
+NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
+
+The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
+each compute node is out of scope of this document. You can either use a custom package repository
+or simply install them manually on each node with Ansible.
+
+#### Configuration example
+
+A configuration snippet is shown below:
+
+```yaml
+openhpc_cluster_name: hpc
+openhpc_nodegroups:
+  - name: general
+  - name: large
+    node_params:
+      CoreSpecCount: 2
+  - name: gpu
+    gres_autodetect: nvml
+    gres:
+      - conf: gpu:A100:2
+```
+for additional context refer to the GPU example in: [Multiple Nodegroups](#multiple-nodegroups).
+
 
 <b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
@@ -4,7 +4,7 @@ openhpc_slurm_service_started: "{{ openhpc_slurm_service_enabled }}"
 openhpc_slurm_service:
 openhpc_slurm_control_host: "{{ inventory_hostname }}"
 #openhpc_slurm_control_host_address:
-openhpc_partitions: []
+openhpc_partitions: "{{ openhpc_nodegroups }}"
 openhpc_nodegroups: []
 openhpc_cluster_name:
 openhpc_packages:
@@ -132,3 +132,10 @@ openhpc_module_system_install: true
 
 # Auto detection
 openhpc_ram_multiplier: 0.95
+
+# Database upgrade
+openhpc_slurm_accounting_storage_service: ''
+openhpc_slurm_accounting_storage_backup_cmd: ''
+openhpc_slurm_accounting_storage_backup_host: "{{ openhpc_slurm_accounting_storage_host }}"
+openhpc_slurm_accounting_storage_backup_become: true
+openhpc_slurm_accounting_storage_client_package: mysql
@@ -1,10 +1,4 @@
 ---
-# NOTE: We need this running before slurmdbd
-- name: Restart Munge service
-  service:
-    name: "munge"
-    state: restarted
-  when: openhpc_slurm_service_started | bool
 
 # NOTE: we need this running before slurmctld start
 - name: Issue slurmdbd restart command
 
@@ -10,7 +10,7 @@ test1  | 1            | N                       | 2x compute node, sequential na
 test1b | 1            | N                       | 1x compute node
 test1c | 1            | N                       | 2x compute nodes, nonsequential names
 test2  | 2            | N                       | 4x compute node, sequential names
-test3  | 1            | Y                       | -
+test3  | 1            | Y                       | 4x compute nodes in 2x groups, single partition
 test4  | 1            | N                       | 2x compute node, accounting enabled
 test5  | 1            | N                       | As for #1 but configless
 test6  | 1            | N                       | 0x compute nodes, configless
@@ -21,7 +21,7 @@ test10 | 1            | N                       | As for #5 but then tries to ad
 test11 | 1            | N                       | As for #5 but then deletes a node (actually changes the partition due to molecule/ansible limitations)
 test12 | 1            | N                       | As for #5 but enabling job completion and testing `sacct -c`
 test13 | 1            | N                       | As for #5 but tests `openhpc_config` variable.
-test14 | 1            | N                       | As for #5 but also tests `extra_nodes` via State=DOWN nodes.
+test14 | 1            | N                       | [removed, extra_nodes removed]
 test15 | 1            | Y                       | As for #5 but also tests `partitions with different name but with the same NodeName`.
 
 
 
@@ -7,7 +7,7 @@
       batch: "{{ inventory_hostname in groups['testohpc_compute'] }}"
       runtime: true
     openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}"
-    openhpc_slurm_partitions:
+    openhpc_nodegroups:
       - name: "compute"
     openhpc_cluster_name: testohpc
   tasks: