Skip to content

Commit 7bb9920

Browse files
committed
Merge branch 'main' into feat/k3s-bootstrap
2 parents 178f853 + c688e3a commit 7bb9920

File tree

15 files changed

+99
-29
lines changed

15 files changed

+99
-29
lines changed

.github/workflows/stackhpc.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,9 +182,8 @@ jobs:
182182
run: |
183183
. venv/bin/activate
184184
. environments/.stackhpc/activate
185-
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
186-
ansible-playbook -v ansible/ci/check_slurm.yml
187185
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
186+
ansible-playbook -v ansible/ci/check_slurm.yml
188187
189188
- name: Check sacct state survived reimage
190189
run: |

ansible/bootstrap.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,13 @@
5252
- import_role:
5353
name: proxy
5454

55+
- hosts: chrony
56+
tags: chrony
57+
become: yes
58+
tasks:
59+
- import_role:
60+
name: mrlesmithjr.chrony
61+
5562
- hosts: cluster
5663
gather_facts: false
5764
become: yes

ansible/roles/compute_init/README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,13 @@
33
Experimental functionality to allow compute nodes to rejoin the cluster after
44
a reboot without running the `ansible/site.yml` playbook.
55

6+
**CAUTION:** The approach used here of exporting cluster secrets over NFS
7+
is considered to be a security risk due to the potential for cluster users to
8+
mount the share on a user-controlled machine by tunnelling through a login
9+
node. This feature should not be enabled on production clusters at this time.
10+
611
To enable this:
7-
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
8-
the default when using cookiecutter to create an environment, via the
9-
"everything" template.
12+
1. Add the `compute` group (or a subset) into the `compute_init` group.
1013
2. Build an image which includes the `compute_init` group. This is the case
1114
for StackHPC-built release images.
1215
3. Enable the required functionalities during boot, by setting the
@@ -40,6 +43,7 @@ it also requires an image build with the role name added to the
4043
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot | n/a |
4144
| bootstrap.yml | resolv_conf | Fully supported | No |
4245
| bootstrap.yml | etc_hosts | Fully supported | No |
46+
| bootstrap.yml | chrony | Fully supported | No |
4347
| bootstrap.yml | proxy | None at present | No |
4448
| bootstrap.yml | (/etc permissions) | None required - use image build | No |
4549
| bootstrap.yml | (ssh /home fix) | None required - use image build | No |

ansible/roles/compute_init/files/compute-init.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
enable_manila: "{{ os_metadata.meta.manila | default(false) | bool }}"
1818
enable_basic_users: "{{ os_metadata.meta.basic_users | default(false) | bool }}"
1919
enable_eessi: "{{ os_metadata.meta.eessi | default(false) | bool }}"
20+
enable_chrony: "{{ os_metadata.meta.chrony | default(false) | bool }}"
2021

2122
# TODO: "= role defaults" - could be moved to a vars_file: on play with similar precedence effects
2223
resolv_conf_nameservers: []
@@ -100,6 +101,11 @@
100101

101102
# TODO: should /mnt/cluster now be UNMOUNTED to avoid future hang-ups?
102103

104+
- name: Run chrony role
105+
ansible.builtin.include_role:
106+
name: mrlesmithjr.chrony
107+
when: enable_chrony | bool
108+
103109
- name: Configure resolve.conf
104110
block:
105111
- name: Set nameservers in /etc/resolv.conf

ansible/roles/compute_init/tasks/install.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@
4343
dest: tasks/tuned.yml
4444
- src: ../../stackhpc.nfs/tasks/nfs-clients.yml
4545
dest: tasks/nfs-clients.yml
46+
- src: ../../mrlesmithjr.chrony
47+
dest: roles/
4648

4749
- name: Add filter_plugins to ansible.cfg
4850
lineinfile:

ansible/roles/hpctests/README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,22 @@ Role Variables
2929
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use). Alternatively a mapping of partition name (as `hpctests_partition`) to device/interface can be used. For partitions not defined in the mapping the default of `all` is used.
3030
- `hpctests_outdir`: Optional. Directory to use for test output on local host. Defaults to `$HOME/hpctests` (for local user).
3131
- `hpctests_hpl_NB`: Optional, default 192. The HPL block size "NB" - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/intel-oneapi-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).
32-
- `hpctests_hpl_mem_frac`: Optional, default 0.8. The HPL problem size "N" will be selected to target using this fraction of each node's memory.
32+
- `hpctests_hpl_mem_frac`: Optional, default 0.3. The HPL problem size "N" will
33+
be selected to target using this fraction of each node's memory -
34+
**CAUTION: see note below**.
3335
- `hpctests_hpl_arch`: Optional, default 'linux64'. Arbitrary architecture name for HPL build. HPL is compiled on the first compute node of those selected (see `hpctests_nodes`), so this can be used to create different builds for different types of compute node.
3436

37+
38+
---
39+
**CAUTION**
40+
41+
> The default of `hpctests_hpl_mem_frac=0.3` will not significantly load nodes.
42+
Values up to ~0.8 may be appropriate for a stress test but ensure cloud
43+
operators are aware in case this overloads e.g. power supplies or cooling.
44+
Values > 0.8 require longer runtimes and increase the risk of out-of-memory
45+
errors without normally significantly increasing the stress on the node.
46+
---
47+
3548
The following variables should not generally be changed:
3649
- `hpctests_pre_cmd`: Optional. Command(s) to include in sbatch templates before module load commands.
3750
- `hpctests_pingmatrix_modules`: Optional. List of modules to load for pingmatrix test. Defaults are suitable for OpenHPC 2.x cluster using the required packages.

ansible/roles/hpctests/defaults/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ hpctests_outdir: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hpctests"
99
hpctests_ucx_net_devices: all
1010
hpctests_hpl_version: "2.3"
1111
hpctests_hpl_NB: 192
12-
hpctests_hpl_mem_frac: 0.8
12+
hpctests_hpl_mem_frac: 0.3
1313
hpctests_hpl_arch: linux64
1414
#hpctests_nodes:
1515
#hpctests_partition:

dev/setup-env.sh

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,30 @@
22

33
set -euo pipefail
44

5-
if [[ -f /etc/os-release ]]; then
6-
. /etc/os-release
7-
OS=$ID
8-
OS_VERSION=$VERSION_ID
9-
else
10-
exit 1
11-
fi
5+
PYTHON_VERSION=${PYTHON_VERSION:-}
126

13-
MAJOR_VERSION=$(echo $OS_VERSION | cut -d. -f1)
7+
if [[ "$PYTHON_VERSION" == "" ]]; then
8+
if [[ -f /etc/os-release ]]; then
9+
. /etc/os-release
10+
OS=$ID
11+
OS_VERSION=$VERSION_ID
12+
else
13+
exit 1
14+
fi
1415

15-
PYTHON_VERSION=""
16+
MAJOR_VERSION=$(echo $OS_VERSION | cut -d. -f1)
1617

17-
if [[ "$OS" == "ubuntu" && "$MAJOR_VERSION" == "22" ]]; then
18-
PYTHON_VERSION="/usr/bin/python3.10"
19-
elif [[ "$OS" == "rocky" && "$MAJOR_VERSION" == "8" ]]; then
20-
# python3.9+ doesn't have selinux bindings
21-
PYTHON_VERSION="/usr/bin/python3.8" # use `sudo yum install python38` on Rocky Linux 8 to install this
22-
elif [[ "$OS" == "rocky" && "$MAJOR_VERSION" == "9" ]]; then
23-
PYTHON_VERSION="/usr/bin/python3.9"
24-
else
25-
echo "Unsupported OS version: $OS $MAJOR_VERSION"
26-
exit 1
18+
if [[ "$OS" == "ubuntu" && "$MAJOR_VERSION" == "22" ]]; then
19+
PYTHON_VERSION="/usr/bin/python3.10"
20+
elif [[ "$OS" == "rocky" && "$MAJOR_VERSION" == "8" ]]; then
21+
# python3.9+ doesn't have selinux bindings
22+
PYTHON_VERSION="/usr/bin/python3.8" # use `sudo yum install python38` on Rocky Linux 8 to install this
23+
elif [[ "$OS" == "rocky" && "$MAJOR_VERSION" == "9" ]]; then
24+
PYTHON_VERSION="/usr/bin/python3.9"
25+
else
26+
echo "Unsupported OS version: $OS $MAJOR_VERSION"
27+
exit 1
28+
fi
2729
fi
2830

2931
if [[ ! -d "venv" ]]; then

docs/chrony.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Chrony configuration
2+
3+
Use variables from the [mrlesmithjr.chrony](https://github.com/mrlesmithjr/ansible-chrony) role.
4+
5+
For example in: `environments/<environment>/inventory/group_vars/all/chrony`:
6+
7+
```
8+
---
9+
chrony_ntp_servers:
10+
- server: ntp-0.example.org
11+
options:
12+
- option: iburst
13+
- option: minpoll
14+
val: 8
15+
- server: ntp-1.example.org
16+
options:
17+
- option: iburst
18+
- option: minpoll
19+
val: 8
20+
21+
```

docs/production.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,6 @@ and referenced from the `site` and `production` environments, e.g.:
127127

128128
- Note [PR 473](https://github.com/stackhpc/ansible-slurm-appliance/pull/473)
129129
may help identify any site-specific configuration.
130+
131+
- See the [hpctests docs](../ansible/roles/hpctests/README.md) for advice on
132+
raising `hpctests_hpl_mem_frac` during tests.

0 commit comments

Comments
 (0)