Skip to content

Commit 1469d5d

Browse files
committed
lower hpl memory fraction to reduce stress from defaults
1 parent 673ef13 commit 1469d5d

File tree

2 files changed

+15
-2
lines changed

2 files changed

+15
-2
lines changed

ansible/roles/hpctests/README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,22 @@ Role Variables
2929
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use). Alternatively a mapping of partition name (as `hpctests_partition`) to device/interface can be used. For partitions not defined in the mapping the default of `all` is used.
3030
- `hpctests_outdir`: Optional. Directory to use for test output on local host. Defaults to `$HOME/hpctests` (for local user).
3131
- `hpctests_hpl_NB`: Optional, default 192. The HPL block size "NB" - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/intel-oneapi-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).
32-
- `hpctests_hpl_mem_frac`: Optional, default 0.8. The HPL problem size "N" will be selected to target using this fraction of each node's memory.
32+
- `hpctests_hpl_mem_frac`: Optional, default 0.3. The HPL problem size "N" will
33+
be selected to target using this fraction of each node's memory -
34+
**CAUTION: see note below**.
3335
- `hpctests_hpl_arch`: Optional, default 'linux64'. Arbitrary architecture name for HPL build. HPL is compiled on the first compute node of those selected (see `hpctests_nodes`), so this can be used to create different builds for different types of compute node.
3436

37+
38+
---
39+
**CAUTION**
40+
41+
> The default of `hpctests_hpl_mem_frac=0.3` will not significantly load nodes.
42+
Values up to ~0.8 may be appropriate for a stress test but ensure cloud
43+
operators are aware in case this overloads e.g. power supplies or cooling.
44+
Values > 0.8 require longer runtimes and increase the risk of out-of-memory
45+
errors without normally significantly increasing the stress on the node.
46+
---
47+
3548
The following variables should not generally be changed:
3649
- `hpctests_pre_cmd`: Optional. Command(s) to include in sbatch templates before module load commands.
3750
- `hpctests_pingmatrix_modules`: Optional. List of modules to load for pingmatrix test. Defaults are suitable for OpenHPC 2.x cluster using the required packages.

ansible/roles/hpctests/defaults/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ hpctests_outdir: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hpctests"
99
hpctests_ucx_net_devices: all
1010
hpctests_hpl_version: "2.3"
1111
hpctests_hpl_NB: 192
12-
hpctests_hpl_mem_frac: 0.8
12+
hpctests_hpl_mem_frac: 0.3
1313
hpctests_hpl_arch: linux64
1414
#hpctests_nodes:
1515
#hpctests_partition:

0 commit comments

Comments
 (0)