Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,11 @@ jobs:
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Reimage compute nodes to image in current branch using slurm - tests compute-init
- name: Reimage compute nodes to image in current branch using slurm
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
ansible-playbook -v ansible/adhoc/rebuild-via-slurm.yml
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Check sacct state survived reimage to current branch
Expand Down
24 changes: 0 additions & 24 deletions ansible/adhoc/reboot_via_slurm.yml

This file was deleted.

17 changes: 17 additions & 0 deletions ansible/adhoc/rebuild-via-slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Rebuild compute nodes via slurm.
# Nodes will be rebuilt if `image_id` in inventory is different to the
# currently-provisioned image. Otherwise they are rebooted.

# Example:
# ansible-playbook -v ansible/adhoc/rebuild-via-slurm.yml

# See docs/slurm-controlled-rebuild.md.

- hosts: login
run_once: true
gather_facts: no
tasks:
- name: Run slurm-controlled rebuild
import_role:
name: rebuild
tasks_from: rebuild.yml
15 changes: 3 additions & 12 deletions ansible/ci/check_slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,10 @@
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
register: sinfo
changed_when: false
until: not ("boot" in sinfo.stdout or "idle*" in sinfo.stdout or "down" in sinfo.stdout)
retries: 10
until: sinfo.stdout_lines == expected_sinfo
retries: 200
delay: 5
- name: Check nodes have expected slurm state
assert:
that: sinfo.stdout_lines == expected_sinfo
fail_msg: |
sinfo output not as expected:
actual:
{{ sinfo.stdout_lines }}
expected:
{{ expected_sinfo }}
<end>
vars:
expected_sinfo:
- " extra up 60-00:00:00 0 n/a" # empty partition
- "{{ openhpc_cluster_name }}-compute-[0-1] standard* up 60-00:00:00 2 idle"
11 changes: 5 additions & 6 deletions ansible/roles/compute_init/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
# EXPERIMENTAL: compute_init

Experimental functionality to allow compute nodes to rejoin the cluster after
a reboot without running the `ansible/site.yml` playbook.
Allow compute nodes to rejoin the cluster after a reboot without running the
`ansible/site.yml` playbook.

**CAUTION:** The approach used here of exporting cluster secrets over NFS
is considered to be a security risk due to the potential for cluster users to
mount the share on a user-controlled machine by tunnelling through a login
node. This feature should not be enabled on production clusters at this time.
> [!NOTE]
> This functionality is marked as experimental as it may be incomplete and the
> required configuration may change with further development.

To enable this:
1. Add the `compute` group (or a subset) into the `compute_init` group.
Expand Down
13 changes: 7 additions & 6 deletions ansible/roles/compute_init/files/compute-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -324,12 +324,6 @@
enabled: true
state: started

- name: Ensure slurmd service state
service:
name: slurmd
enabled: true
state: started

- name: Set locked memory limits on user-facing nodes
lineinfile:
path: /etc/security/limits.conf
Expand All @@ -351,6 +345,13 @@
+:adm:ALL
-:ALL:ALL

- name: Ensure slurmd service state
service:
name: slurmd
enabled: true
state: started


- name: Ensure node is resumed
# TODO: consider if this is always safe for all job states?
command: scontrol update state=resume nodename={{ ansible_hostname }}
Expand Down
51 changes: 38 additions & 13 deletions ansible/roles/rebuild/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,55 @@
rebuild
=========

Enables reboot tool from https://github.com/stackhpc/slurm-openstack-tools.git to be run from control node.
Enables reboot tool from https://github.com/stackhpc/slurm-openstack-tools.git
to be run from control node.

Requirements
------------

clouds.yaml file
An OpenStack clouds.yaml file containing credentials for a cloud under the
"openstack" key.

Role Variables
--------------

- `openhpc_rebuild_clouds`: Directory. Path to clouds.yaml file.
The below is only used by this role's `main.yml` task file, i.e. when running
the `ansible/site.yml` or `ansible/slurm.yml` playbooks:

- `rebuild_clouds_path`: Optional. Path to `clouds.yaml` file on the deploy
host, default `~/.config/openstack/clouds.yaml`.

Example Playbook
----------------
The below are only used by this role's `rebuild.yml` task file, i.e. when
running the `ansible/adhoc/rebuild-via-slurm.yml` playbook:

- hosts: control
become: yes
tasks:
- import_role:
name: rebuild
- `rebuild_job_partitions`: Optional. Comma-separated list of names of rebuild
partitions defined in `openhpc_slurm_partitions`. Useful as an extra-var for
limiting rebuilds. Default `rebuild`.

License
-------
- `rebuild_job_name`: Optional. Name of rebuild jobs. Default is `rebuild-`
suffixed with the node name.

Apache-2.0
- `rebuild_job_command`: Optional. String giving command to run in job after
node has been rebuilt. Default is to sleep for 5 seconds. Note job output is
send to `/dev/null` by default, as the root user running this has no shared
directory for job output.

- `rebuild_job_reboot`: Optional. A bool controlling whether to add the
`--reboot` flag to the job to actually trigger a rebuild. Useful for e.g.
testing partition configurations. Default `true`.

- `rebuild_job_options`: Optional. A string giving any other options to pass to
[sbatch](https://slurm.schedmd.com/sbatch.html). Default is empty string.

- `rebuild_job_user`: Optional. The user to run the rebuild setup and job as.
Default `root`.

- `rebuild_job_template`: Optional. The string to use to submit the job. See
[defaults.yml](defaults/main.yml).

- `rebuild_job_hostlist`: String with a Slurm hostlist expression to restrict
a rebuild to only those nodes (e.g. `tux[1-3]` or `tux1,tux2`). If set,
`rebuild_partitions` must only define a single partition and that partition
must contain those nodes. Not for routine use, but may be useful to e.g.
reattempt a rebuild if this failed on specific nodes. Default is all nodes
in the relevant partition.
23 changes: 22 additions & 1 deletion ansible/roles/rebuild/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,23 @@
---
openhpc_rebuild_clouds: ~/.config/openstack/clouds.yaml

rebuild_clouds_path: ~/.config/openstack/clouds.yaml

rebuild_job_partitions: rebuild
rebuild_job_name: "rebuild-{{ item }}" # item is nodename
rebuild_job_command: 'sleep 5'
rebuild_job_reboot: true
rebuild_job_options: ''
rebuild_job_user: root
rebuild_job_template: >-
sbatch
--nodelist={{ item }}
{{ '--reboot' if rebuild_job_reboot | bool else '' }}
--job-name={{ rebuild_job_name }}
--nodes=1
--exclusive
--partition={{ _rebuild_job_current_partition }}
--no-requeue
--output=/dev/null
--wrap="{{ rebuild_job_command }}"
{{ rebuild_job_options }}
#rebuild_job_hostlist:
2 changes: 1 addition & 1 deletion ansible/roles/rebuild/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

- name: Copy out clouds.yaml
copy:
src: "{{ openhpc_rebuild_clouds }}"
src: "{{ rebuild_clouds_path }}"
dest: /etc/openstack/clouds.yaml
owner: slurm
group: root
Expand Down
11 changes: 11 additions & 0 deletions ansible/roles/rebuild/tasks/rebuild.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- name: Create rebuild jobs for partition
include_tasks:
file: rebuild_partition.yml
args:
apply:
become: yes
become_user: "{{ rebuild_job_user }}"
loop: "{{ rebuild_job_partitions | split(',') }}"
loop_control:
loop_var: _rebuild_job_current_partition

21 changes: 21 additions & 0 deletions ansible/roles/rebuild/tasks/rebuild_partition.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
- name: Get list of nodes in partition
ansible.builtin.command:
cmd: >-
sinfo
--Node
--format=%N
--noheader
--partition={{ _rebuild_job_current_partition }}
register: _sinfo_partition
when: rebuild_job_hostlist is not defined

- name: Expand rebuild_job_hostlist to host names
ansible.builtin.command:
cmd: "scontrol show hostnames {{ rebuild_job_hostlist }}"
register: _scontrol_hostnames
when: rebuild_job_hostlist is defined

- name: Submit rebuild jobs
ansible.builtin.command:
cmd: "{{ rebuild_job_template }}"
loop: "{{ _scontrol_hostnames.stdout_lines | default(_sinfo_partition.stdout_lines) }}"
20 changes: 10 additions & 10 deletions ansible/slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,6 @@
- import_role:
name: rebuild

- name: Setup slurm
hosts: openhpc
become: yes
tags:
- openhpc
tasks:
- include_role:
name: stackhpc.openhpc
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'main.yml' }}"

- name: Set locked memory limits on user-facing nodes
hosts:
- compute
Expand Down Expand Up @@ -63,3 +53,13 @@
+:adm:ALL
-:ALL:ALL
# vagrant uses (deprecated) ansible_ssh_user

- name: Setup slurm
hosts: openhpc
become: yes
tags:
- openhpc
tasks:
- include_role:
name: stackhpc.openhpc
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'main.yml' }}"
Loading