Skip to content

Commit f3aa9a3

Browse files
authored
Merge branch 'main' into feat/lustre-compute-init
2 parents 53b4e9b + 30d6ce4 commit f3aa9a3

File tree

26 files changed

+133
-48
lines changed

26 files changed

+133
-48
lines changed

.github/workflows/stackhpc.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,9 +182,8 @@ jobs:
182182
run: |
183183
. venv/bin/activate
184184
. environments/.stackhpc/activate
185-
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
186-
ansible-playbook -v ansible/ci/check_slurm.yml
187185
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
186+
ansible-playbook -v ansible/ci/check_slurm.yml
188187
189188
- name: Check sacct state survived reimage
190189
run: |

ansible/bootstrap.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,13 @@
5252
- import_role:
5353
name: proxy
5454

55+
- hosts: chrony
56+
tags: chrony
57+
become: yes
58+
tasks:
59+
- import_role:
60+
name: mrlesmithjr.chrony
61+
5562
- hosts: cluster
5663
gather_facts: false
5764
become: yes

ansible/roles/basic_users/README.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,19 @@
22
basic_users
33
===========
44

5-
Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e. without requiring LDAP etc. Features:
5+
Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e.
6+
without requiring LDAP etc. Features:
67
- UID/GID is consistent across cluster (and explicitly defined).
78
- SSH key generated and propagated to all nodes to allow login between cluster nodes.
89
- An "external" SSH key can be added to allow login from elsewhere.
9-
- Login to the control node is prevented.
10+
- Login to the control node is prevented (by default)
1011
- When deleting users, systemd user sessions are terminated first.
1112

1213
Requirements
1314
------------
14-
- $HOME (for normal users, i.e. not `centos`) is assumed to be on a shared filesystem.
15+
- `$HOME` (for normal users, i.e. not `rocky`) is assumed to be on a shared
16+
filesystem. Actions affecting that shared filesystem are run on a single host,
17+
see `basic_users_manage_homedir` below.
1518

1619
Role Variables
1720
--------------
@@ -22,9 +25,15 @@ Role Variables
2225
- `shell` if *not* set will be `/sbin/nologin` on the `control` node and the default shell on other users. Explicitly setting this defines the shell for all nodes.
2326
- An additional key `public_key` may optionally be specified to define a key to log into the cluster.
2427
- An additional key `sudo` may optionally be specified giving a string (possibly multiline) defining sudo rules to be templated.
28+
- `ssh_key_type` defaults to `ed25519` instead of the `ansible.builtin.user` default of `rsa`.
2529
- Any other keys may present for other purposes (i.e. not used by this role).
2630
- `basic_users_groups`: Optional, default empty list. A list of mappings defining information for each group. Mapping keys/values are passed through as parameters to [ansible.builtin.group](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/group_module.html) and default values are as given there.
2731
- `basic_users_override_sssd`: Optional bool, default false. Whether to disable `sssd` when ensuring users/groups exist with this role. Permits creating local users/groups even if they clash with users provided via sssd (e.g. from LDAP). Ignored if host is not in group `sssd` as well. Note with this option active `sssd` will be stopped and restarted each time this role is run.
32+
- `basic_users_manage_homedir`: Optional bool, must be true on a single host to
33+
determine which host runs tasks affecting the shared filesystem. The default
34+
is to use the first play host which is not the control node, because the
35+
default NFS configuration does not have the shared `/home` directory mounted
36+
on the control node.
2837

2938
Dependencies
3039
------------

ansible/roles/basic_users/defaults/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
basic_users_manage_homedir: "{{ (ansible_hostname == (ansible_play_hosts | first)) }}"
1+
basic_users_manage_homedir: "{{ ansible_hostname == (ansible_play_hosts | difference(groups['control']) | first) }}"
22
basic_users_userdefaults:
33
state: present
44
create_home: "{{ basic_users_manage_homedir }}"
55
generate_ssh_key: "{{ basic_users_manage_homedir }}"
66
ssh_key_comment: "{{ item.name }}"
7+
ssh_key_type: ed25519
78
shell: "{{'/sbin/nologin' if 'control' in group_names else omit }}"
89
basic_users_users: []
910
basic_users_groups: []

ansible/roles/basic_users/tasks/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,21 +46,21 @@
4646
- item.state | default('present') == 'present'
4747
- item.public_key is defined
4848
- basic_users_manage_homedir
49-
run_once: true
5049

5150
- name: Write generated public key as authorized for SSH access
51+
# this only runs on the basic_users_manage_homedir so has registered var
52+
# from that host too
5253
authorized_key:
5354
user: "{{ item.name }}"
5455
state: present
5556
manage_dir: no
5657
key: "{{ item.ssh_public_key }}"
57-
loop: "{{ hostvars[ansible_play_hosts | first].basic_users_info.results }}"
58+
loop: "{{ basic_users_info.results }}"
5859
loop_control:
5960
label: "{{ item.name }}"
6061
when:
6162
- item.ssh_public_key is defined
6263
- basic_users_manage_homedir
63-
run_once: true
6464

6565
- name: Write sudo rules
6666
blockinfile:

ansible/roles/compute_init/README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,13 @@
33
Experimental functionality to allow compute nodes to rejoin the cluster after
44
a reboot without running the `ansible/site.yml` playbook.
55

6+
**CAUTION:** The approach used here of exporting cluster secrets over NFS
7+
is considered to be a security risk due to the potential for cluster users to
8+
mount the share on a user-controlled machine by tunnelling through a login
9+
node. This feature should not be enabled on production clusters at this time.
10+
611
To enable this:
7-
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
8-
the default when using cookiecutter to create an environment, via the
9-
"everything" template.
12+
1. Add the `compute` group (or a subset) into the `compute_init` group.
1013
2. Build an image which includes the `compute_init` group. This is the case
1114
for StackHPC-built release images.
1215
3. Enable the required functionalities during boot, by setting the
@@ -40,6 +43,7 @@ it also requires an image build with the role name added to the
4043
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot | n/a |
4144
| bootstrap.yml | resolv_conf | Fully supported | No |
4245
| bootstrap.yml | etc_hosts | Fully supported | No |
46+
| bootstrap.yml | chrony | Fully supported | No |
4347
| bootstrap.yml | proxy | None at present | No |
4448
| bootstrap.yml | (/etc permissions) | None required - use image build | No |
4549
| bootstrap.yml | (ssh /home fix) | None required - use image build | No |

ansible/roles/compute_init/files/compute-init.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
enable_lustre: "{{ os_metadata.meta.lustre | default(false) | bool }}"
1919
enable_basic_users: "{{ os_metadata.meta.basic_users | default(false) | bool }}"
2020
enable_eessi: "{{ os_metadata.meta.eessi | default(false) | bool }}"
21+
enable_chrony: "{{ os_metadata.meta.chrony | default(false) | bool }}"
2122

2223
# TODO: "= role defaults" - could be moved to a vars_file: on play with similar precedence effects
2324
resolv_conf_nameservers: []
@@ -101,6 +102,11 @@
101102

102103
# TODO: should /mnt/cluster now be UNMOUNTED to avoid future hang-ups?
103104

105+
- name: Run chrony role
106+
ansible.builtin.include_role:
107+
name: mrlesmithjr.chrony
108+
when: enable_chrony | bool
109+
104110
- name: Configure resolve.conf
105111
block:
106112
- name: Set nameservers in /etc/resolv.conf

ansible/roles/compute_init/tasks/install.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@
4545
dest: tasks/nfs-clients.yml
4646
- src: ../../lustre
4747
dest: roles/
48+
- src: ../../mrlesmithjr.chrony
49+
dest: roles/
4850

4951
- name: Add filter_plugins to ansible.cfg
5052
lineinfile:

ansible/roles/hpctests/README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,22 @@ Role Variables
2929
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use). Alternatively a mapping of partition name (as `hpctests_partition`) to device/interface can be used. For partitions not defined in the mapping the default of `all` is used.
3030
- `hpctests_outdir`: Optional. Directory to use for test output on local host. Defaults to `$HOME/hpctests` (for local user).
3131
- `hpctests_hpl_NB`: Optional, default 192. The HPL block size "NB" - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/intel-oneapi-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).
32-
- `hpctests_hpl_mem_frac`: Optional, default 0.8. The HPL problem size "N" will be selected to target using this fraction of each node's memory.
32+
- `hpctests_hpl_mem_frac`: Optional, default 0.3. The HPL problem size "N" will
33+
be selected to target using this fraction of each node's memory -
34+
**CAUTION: see note below**.
3335
- `hpctests_hpl_arch`: Optional, default 'linux64'. Arbitrary architecture name for HPL build. HPL is compiled on the first compute node of those selected (see `hpctests_nodes`), so this can be used to create different builds for different types of compute node.
3436

37+
38+
---
39+
**CAUTION**
40+
41+
> The default of `hpctests_hpl_mem_frac=0.3` will not significantly load nodes.
42+
Values up to ~0.8 may be appropriate for a stress test but ensure cloud
43+
operators are aware in case this overloads e.g. power supplies or cooling.
44+
Values > 0.8 require longer runtimes and increase the risk of out-of-memory
45+
errors without normally significantly increasing the stress on the node.
46+
---
47+
3548
The following variables should not generally be changed:
3649
- `hpctests_pre_cmd`: Optional. Command(s) to include in sbatch templates before module load commands.
3750
- `hpctests_pingmatrix_modules`: Optional. List of modules to load for pingmatrix test. Defaults are suitable for OpenHPC 2.x cluster using the required packages.

ansible/roles/hpctests/defaults/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ hpctests_outdir: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hpctests"
99
hpctests_ucx_net_devices: all
1010
hpctests_hpl_version: "2.3"
1111
hpctests_hpl_NB: 192
12-
hpctests_hpl_mem_frac: 0.8
12+
hpctests_hpl_mem_frac: 0.3
1313
hpctests_hpl_arch: linux64
1414
#hpctests_nodes:
1515
#hpctests_partition:

0 commit comments

Comments
 (0)