Skip to content

Commit 4d6dee0

Browse files
committed
merge conflicts
2 parents bdd265a + ede561f commit 4d6dee0

File tree

89 files changed

+1402
-553
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+1402
-553
lines changed

.github/workflows/stackhpc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -178,11 +178,11 @@ jobs:
178178
ansible-playbook -v ansible/site.yml
179179
ansible-playbook -v ansible/ci/check_slurm.yml
180180
181-
- name: Test reimage of compute nodes and compute-init (via rebuild adhoc)
181+
- name: Test compute node reboot and compute-init
182182
run: |
183183
. venv/bin/activate
184184
. environments/.stackhpc/activate
185-
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
185+
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
186186
ansible-playbook -v ansible/ci/check_slurm.yml
187187
188188
- name: Check sacct state survived reimage

README.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,7 @@ It requires an OpenStack cloud, and an Ansible "deploy host" with access to that
3131

3232
Before starting ensure that:
3333
- You have root access on the deploy host.
34-
- You can create instances using a Rocky 9 GenericCloud image (or an image based on that).
35-
- **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI.
34+
- You can create instances from the [latest Slurm appliance image](https://github.com/stackhpc/ansible-slurm-appliance/releases), which already contains the required packages. This is built and tested in StackHPC's CI. Although you can use a Rocky Linux 9 GenericCloud instead, it is not recommended.
3635
- You have an SSH keypair defined in OpenStack, with the private key available on the deploy host.
3736
- Created instances have access to internet (note proxies can be setup through the appliance if necessary).
3837
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).
@@ -82,30 +81,39 @@ And generate secrets for it:
8281

8382
Create an OpenTofu variables file to define the required infrastructure, e.g.:
8483

85-
# environments/$ENV/terraform/terraform.tfvars:
84+
# environments/$ENV/tofu/terraform.tfvars:
8685

8786
cluster_name = "mycluster"
88-
cluster_net = "some_network" # *
89-
cluster_subnet = "some_subnet" # *
87+
cluster_networks = [
88+
{
89+
network = "some_network" # *
90+
subnet = "some_subnet" # *
91+
}
92+
]
9093
key_pair = "my_key" # *
9194
control_node_flavor = "some_flavor_name"
92-
login_nodes = {
93-
login-0: "login_flavor_name"
95+
login = {
96+
# Arbitrary group name for these login nodes
97+
interactive = {
98+
nodes: ["login-0"]
99+
flavor: "login_flavor_name" # *
100+
}
94101
}
95102
cluster_image_id = "rocky_linux_9_image_uuid"
96103
compute = {
104+
# Group name used for compute node partition definition
97105
general = {
98106
nodes: ["compute-0", "compute-1"]
99-
flavor: "compute_flavor_name"
107+
flavor: "compute_flavor_name" # *
100108
}
101109
}
102110

103-
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/terraform/terraform.tfvars`.
111+
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/tofu/variables.tf`.
104112

105113
To deploy this infrastructure, ensure the venv and the environment are [activated](#create-a-new-environment) and run:
106114

107115
export OS_CLOUD=openstack
108-
cd environments/$ENV/terraform/
116+
cd environments/$ENV/tofu/
109117
tofu init
110118
tofu apply
111119

ansible/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ roles/*
3232
!roles/mysql/**
3333
!roles/systemd/
3434
!roles/systemd/**
35+
!roles/cacerts/
36+
!roles/cacerts/**
3537
!roles/cuda/
3638
!roles/cuda/**
3739
!roles/freeipa/
@@ -82,3 +84,5 @@ roles/*
8284
!roles/slurm_stats/**
8385
!roles/pytools/
8486
!roles/pytools/**
87+
!roles/rebuild/
88+
!roles/rebuild/**

ansible/adhoc/reboot_via_slurm.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Reboot compute nodes via slurm. Nodes will be rebuilt if `image_id` in inventory is different to the currently-provisioned image.
2+
# Example:
3+
# ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
4+
5+
- hosts: login
6+
run_once: true
7+
become: yes
8+
gather_facts: no
9+
tasks:
10+
- name: Submit a Slurm job to reboot compute nodes
11+
ansible.builtin.shell: |
12+
set -e
13+
srun --reboot -N 2 uptime
14+
become_user: root
15+
register: slurm_result
16+
failed_when: slurm_result.rc != 0
17+
18+
- name: Fetch Slurm controller logs if reboot fails
19+
ansible.builtin.shell: |
20+
journalctl -u slurmctld --since "10 minutes ago" | tail -n 50
21+
become_user: root
22+
register: slurm_logs
23+
when: slurm_result.rc != 0
24+
delegate_to: "{{ groups['control'] | first }}"

ansible/bootstrap.yml

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,13 @@
5252
- import_role:
5353
name: proxy
5454

55+
- hosts: chrony
56+
tags: chrony
57+
become: yes
58+
tasks:
59+
- import_role:
60+
name: mrlesmithjr.chrony
61+
5562
- hosts: cluster
5663
gather_facts: false
5764
become: yes
@@ -126,22 +133,46 @@
126133
ansible.builtin.assert:
127134
that: dnf_repos_password is undefined
128135
fail_msg: Passwords should not be templated into repofiles during configure, unset 'dnf_repos_password'
129-
when: appliances_mode == 'configure'
130-
- name: Replace system repos with pulp repos
131-
ansible.builtin.include_role:
132-
name: dnf_repos
133-
tasks_from: set_repos.yml
136+
when:
137+
- appliances_mode == 'configure'
138+
- not (dnf_repos_allow_insecure_creds | default(false)) # useful for development
139+
140+
- hosts: cacerts:!builder
141+
tags: cacerts
142+
gather_facts: false
143+
tasks:
144+
- name: Install custom cacerts
145+
import_role:
146+
name: cacerts
134147

135-
# --- tasks after here require access to package repos ---
136148
- hosts: squid
137149
tags: squid
138150
gather_facts: yes
139151
become: yes
140152
tasks:
153+
# - Installing squid requires working dnf repos
154+
# - Configuring dnf_repos itself requires working dnf repos to install epel
155+
# - Hence do this on squid nodes first in case they are proxying others
156+
- name: Replace system repos with pulp repos
157+
ansible.builtin.include_role:
158+
name: dnf_repos
159+
tasks_from: set_repos.yml
160+
when: "'dnf_repos' in group_names"
141161
- name: Configure squid proxy
142162
import_role:
143163
name: squid
144164

165+
- hosts: dnf_repos
166+
tags: dnf_repos
167+
gather_facts: yes
168+
become: yes
169+
tasks:
170+
- name: Replace system repos with pulp repos
171+
ansible.builtin.include_role:
172+
name: dnf_repos
173+
tasks_from: set_repos.yml
174+
175+
# --- tasks after here require general access to package repos ---
145176
- hosts: tuned
146177
tags: tuned
147178
gather_facts: yes
@@ -282,10 +313,11 @@
282313
- include_role:
283314
name: azimuth_cloud.image_utils.linux_ansible_init
284315

285-
- hosts: k3s
316+
- hosts: k3s:&builder
286317
become: yes
287318
tags: k3s
288319
tasks:
289-
- ansible.builtin.include_role:
320+
- name: Install k3s
321+
ansible.builtin.include_role:
290322
name: k3s
291323
tasks_from: install.yml

ansible/ci/check_slurm.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
77
register: sinfo
88
changed_when: false
9-
until: not ("boot" in sinfo.stdout or "idle*" in sinfo.stdout)
9+
until: not ("boot" in sinfo.stdout or "idle*" in sinfo.stdout or "down" in sinfo.stdout)
1010
retries: 10
1111
delay: 5
1212
- name: Check nodes have expected slurm state

ansible/ci/retrieve_inventory.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
gather_facts: no
88
vars:
99
cluster_prefix: "{{ undef(hint='cluster_prefix must be defined') }}" # e.g. ci4005969475
10-
ci_vars_file: "{{ appliances_environment_root + '/terraform/' + lookup('env', 'CI_CLOUD') }}.tfvars"
10+
ci_vars_file: "{{ appliances_environment_root + '/tofu/' + lookup('env', 'CI_CLOUD') }}.tfvars"
1111
cluster_network: "{{ lookup('ansible.builtin.ini', 'cluster_net', file=ci_vars_file, type='properties') | trim('\"') }}"
1212
tasks:
1313
- name: Get control host IP

ansible/extras.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,23 @@
1+
- hosts: k3s_server:!builder
2+
become: yes
3+
tags: k3s
4+
tasks:
5+
- name: Start k3s server
6+
ansible.builtin.include_role:
7+
name: k3s
8+
tasks_from: server-runtime.yml
9+
10+
# technically should be part of bootstrap.yml but hangs waiting on failed mounts
11+
# if runs before filesystems.yml after the control node has been reimaged
12+
- hosts: k3s_agent:!builder
13+
become: yes
14+
tags: k3s
15+
tasks:
16+
- name: Start k3s agents
17+
ansible.builtin.include_role:
18+
name: k3s
19+
tasks_from: agent-runtime.yml
20+
121
- hosts: basic_users:!builder
222
become: yes
323
tags:

ansible/roles/basic_users/README.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,19 @@
22
basic_users
33
===========
44

5-
Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e. without requiring LDAP etc. Features:
5+
Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e.
6+
without requiring LDAP etc. Features:
67
- UID/GID is consistent across cluster (and explicitly defined).
78
- SSH key generated and propagated to all nodes to allow login between cluster nodes.
89
- An "external" SSH key can be added to allow login from elsewhere.
9-
- Login to the control node is prevented.
10+
- Login to the control node is prevented (by default)
1011
- When deleting users, systemd user sessions are terminated first.
1112

1213
Requirements
1314
------------
14-
- $HOME (for normal users, i.e. not `centos`) is assumed to be on a shared filesystem.
15+
- `$HOME` (for normal users, i.e. not `rocky`) is assumed to be on a shared
16+
filesystem. Actions affecting that shared filesystem are run on a single host,
17+
see `basic_users_manage_homedir` below.
1518

1619
Role Variables
1720
--------------
@@ -22,9 +25,15 @@ Role Variables
2225
- `shell` if *not* set will be `/sbin/nologin` on the `control` node and the default shell on other users. Explicitly setting this defines the shell for all nodes.
2326
- An additional key `public_key` may optionally be specified to define a key to log into the cluster.
2427
- An additional key `sudo` may optionally be specified giving a string (possibly multiline) defining sudo rules to be templated.
28+
- `ssh_key_type` defaults to `ed25519` instead of the `ansible.builtin.user` default of `rsa`.
2529
- Any other keys may present for other purposes (i.e. not used by this role).
2630
- `basic_users_groups`: Optional, default empty list. A list of mappings defining information for each group. Mapping keys/values are passed through as parameters to [ansible.builtin.group](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/group_module.html) and default values are as given there.
2731
- `basic_users_override_sssd`: Optional bool, default false. Whether to disable `sssd` when ensuring users/groups exist with this role. Permits creating local users/groups even if they clash with users provided via sssd (e.g. from LDAP). Ignored if host is not in group `sssd` as well. Note with this option active `sssd` will be stopped and restarted each time this role is run.
32+
- `basic_users_manage_homedir`: Optional bool, must be true on a single host to
33+
determine which host runs tasks affecting the shared filesystem. The default
34+
is to use the first play host which is not the control node, because the
35+
default NFS configuration does not have the shared `/home` directory mounted
36+
on the control node.
2837

2938
Dependencies
3039
------------

ansible/roles/basic_users/defaults/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
basic_users_manage_homedir: "{{ (ansible_hostname == (ansible_play_hosts | first)) }}"
1+
basic_users_manage_homedir: "{{ ansible_hostname == (ansible_play_hosts | difference(groups['control']) | first) }}"
22
basic_users_userdefaults:
33
state: present
44
create_home: "{{ basic_users_manage_homedir }}"
55
generate_ssh_key: "{{ basic_users_manage_homedir }}"
66
ssh_key_comment: "{{ item.name }}"
7+
ssh_key_type: ed25519
78
shell: "{{'/sbin/nologin' if 'control' in group_names else omit }}"
89
basic_users_users: []
910
basic_users_groups: []

0 commit comments

Comments
 (0)