Skip to content

Commit 1a4fb50

Browse files
authored
start to add hpl (#58)
* start to add hpl black magic? har har har * clean up hpl to use spack build * add prototype hpl and add back chatterbug Signed-off-by: vsoch <[email protected]>
1 parent 8acda04 commit 1a4fb50

File tree

11 files changed

+916
-61
lines changed

11 files changed

+916
-61
lines changed

docs/_static/data/metrics.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@
1515
"image": "ghcr.io/converged-computing/metric-bdas:latest",
1616
"url": "https://asc.llnl.gov/sites/asc/files/2020-09/BDAS_Summary_b4bcf27_0.pdf"
1717
},
18+
{
19+
"name": "app-hpl",
20+
"description": "High-Performance Linpack (HPL)",
21+
"family": "solver",
22+
"type": "standalone",
23+
"image": "ghcr.io/converged-computing/metric-hpl-spack:latest",
24+
"url": "https://www.netlib.org/benchmark/hpl/"
25+
},
1826
{
1927
"name": "app-kripke",
2028
"description": "parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids",
@@ -95,6 +103,14 @@
95103
"image": "ghcr.io/converged-computing/metric-sysstat:latest",
96104
"url": "https://github.com/sysstat/sysstat"
97105
},
106+
{
107+
"name": "network-chatterbug",
108+
"description": "A suite of communication proxies for HPC applications",
109+
"family": "network",
110+
"type": "standalone",
111+
"image": "ghcr.io/converged-computing/metric-chatterbug:latest",
112+
"url": "https://github.com/hpcgroup/chatterbug"
113+
},
98114
{
99115
"name": "network-netmark",
100116
"description": "point to point networking tool",

docs/development/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ any questions, please [let us know](https://github.com/converged-computing/metri
88
:maxdepth: 3
99
developer-guide
1010
designs
11+
metrics
1112
debugging
1213
creation
1314
```

docs/development/metrics.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# In Progress Metrics
2+
3+
These are metrics that are consistered under development (and likely need more eyes) to get fully working.
4+
5+
## Network
6+
7+
### network-chatterbug
8+
9+
- [Standalone Metric Set](user-guide.md#application-metric-set)
10+
- *[network-chatterbug](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/network-chatterbug)*
11+
12+
Chatterbug provides a [suite of communication proxy applications](https://github.com/hpcgroup/chatterbug) for HPC.
13+
We use a launcher/worker design.
14+
15+
|Name | Description | Type | Default |
16+
|-----|--------------|------|---------|
17+
| mpirun | The options to give to mpirun (includes tasks) | string | `-N 8` |
18+
| command | The chatterbug command (subdirectory) to run, see options below | string | stencil3d |
19+
| args | Arguments for the command | string | `1 2 2 10 10 10 4 1` |
20+
| sole-tenancy | Require sole tenancy | string ("true" or "false") | "true" |
21+
22+
By default, we require sole-tenancy, but you can disable this. Note that the best place to look for "documentation"
23+
on the commands seems to be [the source code]((https://github.com/hpcgroup/chatterbug)). The following command options
24+
are available for `command`:
25+
26+
- pairs
27+
- ping-ping
28+
- spread
29+
- stencil3d
30+
- stencil4d
31+
- subcom2d-coll
32+
- subcom2d-a2a
33+
- unstr-mesh
34+
35+
We have tested mostly stencil3d. Note that the mpirun command is parsed as follows:
36+
37+
```bash
38+
$ mpirun --hostfile ./hostfile.txt --allow-run-as-root -N 4 /root/chatterbug/${command}/${executable} ${args}
39+
```
40+
41+
Thus for the defaults, you'd get this command (on one pod):
42+
43+
```bash
44+
$ mpirun --hostfile ./hostfile.txt --allow-run-as-root -N 4 /root/chatterbug/stencil3d/stencil3d.x 1 2 2 10 10 10 4 1
45+
```
46+
47+
See the example linked in the header for a metrics.yaml example.
48+
49+
## Standalone
50+
51+
### app-hpl
52+
53+
- [Standalone Metric Set](user-guide.md#application-metric-set)
54+
- *[app-hpl](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-hpl)*
55+
56+
The [Linpack](https://ulhpc-tutorials.readthedocs.io/en/production/parallel/mpi/HPL/) benchmark is used for the [Top500](https://www.top500.org/project/linpack/),
57+
and generally is solving a dense system of linear equations. Arguments to customize include the following:
58+
59+
| Name | Description | Type | Default |
60+
|-----|-------------|------|---------|
61+
| mpiargs | Arguments to give to mpi | string | empty string |
62+
| tasks | Number of tasks per node | int32 | detected used nproc |
63+
| ratio | target memory occupation | string (but as a float, e.g., "0.3") | "0.3" |
64+
| memory | memory in GiB | int32 | detected from proc |
65+
| blocksize | blocksize is the NBs "number blocks" value | int32 | |
66+
| pfact | | int32 | |
67+
| nbmin | | int32 | |
68+
| ndiv | | int32 | |
69+
| row_or_colmajor_pmapping | PMAP process mapping (0=Row-,1=Column-major) | int32 | 0 |
70+
| rfact | (0=left, 1=Crout, 2=Right) | int32 | 0 |
71+
| bcast | (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) | int32 | 0 |
72+
| depth | number of lookahead depth | int32 | 0 |
73+
| swap | (0=bin-exch,1=long,2=mix) | int32 | 0 |
74+
| swappingThreshold | | int32 | 64 |
75+
| l1transposed | (0=transposed,1=no-transposed) | int32 | 0 |
76+
| utransposed | (0=transposed,1=no-transposed) | int32 | 0 |
77+
| memAlignment | memory alignment in double (> 0) (4,8,16) | int32 | |
78+
79+
For the meaning of each of these, see [this documentation](https://ulhpc-tutorials.readthedocs.io/en/production/parallel/mpi/HPL/#hpl-main-parameters)
80+
and how they are used in [hpl.go](https://github.com/converged-computing/metrics-operator/tree/main/pkg/metrics/app/hpl.go)
81+
I made an effort to define them above, but you should consult the documentation above, because I don't fully
82+
understand these yet.
83+
84+
We provide a simple build here, as typically vendors spend a lot of time custom-compiling the code
85+
for their architectures (and we are compiling for general use). We will use a script `compute_N` from the OLHPC Tutorials to generate input data for a particular
86+
problem size, and you can vary the input to this script via the `computeArgs` parameters. We use a default, and you can inspect the
87+
script help below:
88+
89+
<details>
90+
91+
<summary>`compute_N --help`</summary>
92+
93+
```console
94+
# compute_N -h
95+
Compute N for HPL runs.
96+
97+
SYNOPSIS
98+
compute_N [-v] [--mem <SIZE_IN_GB>] [-N <NODES>] [-r <RATIO>] [-NB <NB>]
99+
compute_N [-v] [--mem <SIZE_IN_GB>] [-N <NODES>] [-p <PERCENTAGE_MEM>] [-NB <NB>]
100+
101+
The following formulae is used (when using '-r <ratio>'):
102+
N = <ratio>*SQRT( Total Memory Size in bytes / sizeof(double) )
103+
= <ratio>*SQRT( <nnodes> * <ram_size> / 8)
104+
105+
Alternatively you may wish to specify a memory usage ratio (with -p <percentage_mem>),
106+
in which case the following formulae is used:
107+
N = SQRT( <percentage_mem>/100 * Total Memory Size in bytes / sizeof(doubl)
108+
109+
OPTIONS
110+
-m --mem --ramsize <SIZE>
111+
Specify the total memory size per node, in GiB.
112+
Default RAM size consider (yet in KiB): 16051112 KiB
113+
-N --nodes <N>
114+
Number of compute nodes
115+
-NB <NB>
116+
NB parameters to use. Default: 192 (384 for skylake)
117+
-p --memshare <PERCENTAGE_MEM>
118+
Percentage of the total memory size to use.
119+
Derived from the below global ratio (i.e. 0% since RATIO=0.8)
120+
-r --ratio <RATIO>
121+
Global ratio to apply. Default: 0.8
122+
123+
EXAMPLE
124+
For 2 broadwell nodes on iris cluster, using 30% of the total memory per node:
125+
compute_N -N 2 -p 30 -m 128 -NB 192
126+
For 4 skylake nodes on iris cluster, using 85% of the total memory per node:
127+
compute_N -N 4 -p 85 -m 128 -NB 384
128+
129+
AUTHORS
130+
Sebastien Varrette <[email protected]> and UL HPC Team
131+
132+
COPYRIGHT
133+
This is free software; see the source for copying conditions. There is
134+
NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
135+
```
136+
137+
</details>
138+
139+
The following examples are [provided](https://ulhpc-tutorials.readthedocs.io/en/production/parallel/mpi/HPL/) to generate the HPL.dat for the analysis:
140+
141+
```bash
142+
/opt/tutorials/benchmarks/HPL/scripts/compute_N -h
143+
# 1 Broadwell node, alpha = 0.3
144+
/opt/tutorials/benchmarks/HPL/scripts/compute_N -m 128 -NB 192 -r 0.3 -N 1
145+
# 2 Skylake (regular) nodes, alpha = 0.3
146+
/opt/tutorials/benchmarks/HPL/scripts/compute_N -m 128 -NB 384 -r 0.3 -N 2
147+
# 4 bigmem (skylake) nodes, beta = 0.85
148+
/opt/tutorials/benchmarks/HPL/scripts/compute_N -m 3072 -NB 384 -p 85 -N 4
149+
```
150+
151+
Here is a tiny setup I created for a testing case:
152+
153+
```bash
154+
/opt/tutorials/benchmarks/HPL/scripts/compute_N -m 128 -NB 192 -r 0.3 -N 2
155+
```
156+
157+
Next, you might care about the input data, a file called `hpl.dat`. By default we use
158+
a template that is populated by the above variables, and here is another example that I found
159+
in the repository:
160+
161+
<details>
162+
163+
<summary>Default hpl.dat</summary>
164+
165+
```console
166+
HPLinpack benchmark input file
167+
Innovative Computing Laboratory, University of Tennessee
168+
HPL.out output file name (if any)
169+
6 device out (6=stdout,7=stderr,file)
170+
1 # of problems sizes (N)
171+
24650 Ns
172+
1 # of NBs
173+
192 NBs
174+
0 PMAP process mapping (0=Row-,1=Column-major)
175+
2 # of process grids (P x Q)
176+
2 4 Ps
177+
14 7 Qs
178+
16.0 threshold
179+
1 # of panel fact
180+
2 PFACTs (0=left, 1=Crout, 2=Right)
181+
1 # of recursive stopping criterium
182+
4 NBMINs (>= 1)
183+
1 # of panels in recursion
184+
2 NDIVs
185+
1 # of recursive panel fact.
186+
1 RFACTs (0=left, 1=Crout, 2=Right)
187+
1 # of broadcast
188+
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
189+
1 # of lookahead depth
190+
1 DEPTHs (>=0)
191+
2 SWAP (0=bin-exch,1=long,2=mix)
192+
64 swapping threshold
193+
0 L1 in (0=transposed,1=no-transposed) form
194+
0 U in (0=transposed,1=no-transposed) form
195+
1 Equilibration (0=no,1=yes)
196+
8 memory alignment in double (> 0)
197+
##### This line (no. 32) is ignored (it serves as a separator). ######
198+
0 Number of additional problem sizes for PTRANS
199+
1200 10000 30000 values of N
200+
0 number of additional blocking sizes for PTRANS
201+
40 9 8 13 13 20 16 32 64 values of NB
202+
```
203+
204+
</details>
205+
206+
If there is something above not properly exposed please [let us know](https://github.com/converged-computing/metrics-operator/issues).

docs/getting_started/metrics.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,6 @@ More likely you want an actual problem size on a specific number of node and tas
369369
run a larger problem and the parser does not work as expected, please [send us the output](https://github.com/converged-computing/metrics-operator/issues) and we will provide an updated parser.
370370
See [this guide](https://asc.llnl.gov/sites/asc/files/2020-09/AMG_Summary_v1_7.pdf) for more detail.
371371
372-
373372
#### app-quicksilver
374373
375374
- [Standalone Metric Set](user-guide.md#application-metric-set)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
apiVersion: flux-framework.org/v1alpha1
2+
kind: MetricSet
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: metricset
6+
app.kubernetes.io/instance: metricset-sample
7+
name: metricset-sample
8+
spec:
9+
pods: 2
10+
logging:
11+
interactive: true
12+
13+
# This is not currently fully working, hence why we do not have it documented yet, etc.
14+
metrics:
15+
- name: app-hpl

examples/tests/network-chatterbug/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Chatterbug Networking Example
22

33
This will demonstrate running a [Chatterbug](https://github.com/hpcgroup/chatterbug) metric.
4+
This metric is experimental and not working in all contexts.
45

56
## Usage
67

pkg/jobs/launcher.go

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,16 @@ func (m LauncherWorker) GetCommonPrefix(
142142
hosts string,
143143
) string {
144144

145+
// Generate problem.sh with command only if we have one!
146+
if command != "" {
147+
command = fmt.Sprintf(`# Write the command file
148+
cat <<EOF > ./problem.sh
149+
#!/bin/bash
150+
%s
151+
EOF
152+
chmod +x ./problem.sh`, command)
153+
}
154+
145155
prefixTemplate := `#!/bin/bash
146156
# Start ssh daemon
147157
/usr/sbin/sshd -D &
@@ -153,12 +163,7 @@ cat <<EOF > ./hostlist.txt
153163
%s
154164
EOF
155165
156-
# Write the command file
157-
cat <<EOF > ./problem.sh
158-
#!/bin/bash
159166
%s
160-
EOF
161-
chmod +x ./problem.sh
162167
163168
# Allow network to ready (this could be a variable)
164169
echo "Sleeping for 10 seconds waiting for network..."

0 commit comments

Comments
 (0)