Skip to content

Commit 31aed98

Browse files
authored
GPU-Aware MPI on OLCF Frontier and Combined weak- & strong-scaling case (#448)
1 parent daa8e85 commit 31aed98

File tree

30 files changed

+398
-169
lines changed

30 files changed

+398
-169
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ examples/*/viz/
5353
examples/*.jpg
5454
examples/*.png
5555
examples/*/workloads/
56+
examples/*/run-*/
57+
examples/*/logs/
5658
workloads/
5759

5860
benchmarks/*batch/*/

docs/documentation/case.md

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -30,40 +30,39 @@ This is particularly useful when computations are done in Python to generate the
3030

3131
## (Optional) Accepting command line arguments
3232

33-
Input files can accept **positional** command line arguments, forwarded by `mfc.sh run`.
34-
Consider this example from the 3D_weak_scaling case:
33+
Input files can accept command line arguments, forwarded by `mfc.sh run`.
34+
Consider this example from the `scaling` case:
3535

3636
```python
3737
import argparse
3838

3939
parser = argparse.ArgumentParser(
40-
prog="3D_weak_scaling",
41-
description="This MFC case was created for the purposes of weak scaling.",
40+
prog="scaling",
41+
description="Weak- and strong-scaling benchmark case.",
4242
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
4343

44-
parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
45-
parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory")
44+
parser.add_argument("dict", type=str, metavar="DICT")
45+
parser.add_argument("-s", "--scaling", type=str, metavar="SCALING", choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")
4646

4747
# Your parsed arguments are here
48-
ARGS = vars(parser.parse_args())
48+
args = parser.parse_args()
4949
```
5050

5151
The first argument is always a JSON string representing `mfc.sh run`'s internal
5252
state.
5353
It contains all the runtime information you might want from the build/run system.
54-
We hide it from the help menu with `help=argparse.SUPPRESS` since it is not meant to be passed in by users.
55-
You can add as many additional positional arguments as you may need.
54+
You can add as many additional arguments as you may need.
5655

5756
To run such a case, use the following format:
5857

5958
```shell
60-
./mfc.sh run <path/to/case.py> <positional arguments> <regular mfc.sh run arguments>
59+
./mfc.sh run <path/to/case.py> <mfc.sh run arguments> -- <case arguments>
6160
```
6261

63-
For example, to run the 3D_weak_scaling case with `gbpp=2`:
62+
For example, to run the `scaling` case in "weak-scaling" mode:
6463

6564
```shell
66-
./mfc.sh run examples/3D_weak_scaling/case.py 2 -t pre_process -j 8
65+
./mfc.sh run examples/scaling/case.py -t pre_process -j 8 -- --scaling weak
6766
```
6867

6968
## Parameters
@@ -87,11 +86,15 @@ Definition of the parameters is described in the following subsections.
8786

8887
### 1. Runtime
8988

90-
| Parameter | Type | Description |
91-
| ---: | :----: | :--- |
92-
| `run_time_info` | Logical | Output run-time information |
89+
| Parameter | Type | Description |
90+
| ---: | :----: | :--- |
91+
| `run_time_info` | Logical | Output run-time information |
92+
| `rdma_mpi` | Logical | (GPUs) Enable RDMA for MPI communication. |
9393

9494
- `run_time_info` generates a text file that includes run-time information including the CFL number(s) at each time-step.
95+
- `rdma_mpi` optimizes data transfers between GPUs using Remote Direct Memory Access (RDMA).
96+
The underlying MPI implementation and communication infrastructure must support this
97+
feature, detecting GPU pointers and performing RDMA accordingly.
9598

9699
### 2. Computational Domain
97100

docs/documentation/running.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@ several supercomputer clusters, both interactively and through batch submission.
2424
>
2525
> If `-c <computer name>` is left unspecified, it defaults to `-c default`.
2626
27-
Additional flags can be appended to the MPI executable call using the `-f` (i.e `--flags`) option.
28-
2927
Please refer to `./mfc.sh run -h` for a complete list of arguments and options, along with their defaults.
3028

3129
## Interactive Execution

examples/2D_whale_bubble_annulus/case.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,5 @@
193193
'Mono(1)%pulse' : 1,
194194
'Mono(1)%mag' : 1.,
195195
'Mono(1)%length' : 0.2,
196-
'cu_mpi' : 'F',
197-
196+
'rdma_mpi' : 'F',
198197
}))

examples/3D_weak_scaling/README.md

Lines changed: 0 additions & 24 deletions
This file was deleted.

examples/3D_weak_scaling/analyze.sh

Lines changed: 0 additions & 5 deletions
This file was deleted.

examples/scaling/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Strong- & Weak-scaling
2+
3+
The [**Scaling**](case.py) case can exercise both weak- and strong-scaling. It
4+
adjusts itself depending on the number of requested ranks.
5+
6+
This directory also contains a collection of scripts used to test strong-scaling
7+
on OLCF Frontier. They required modifying MFC to collect some metrics but are
8+
meant to serve as a reference to users wishing to run similar experiments.
9+
10+
## Weak Scaling
11+
12+
Pass `--scaling weak`. The `--memory` option controls (approximately) how much
13+
memory each rank should use, in Gigabytes. The number of cells in each dimension
14+
is then adjusted according to the number of requested ranks and an approximation
15+
for the relation between cell count and memory usage. The problem size increases
16+
linearly with the number of ranks.
17+
18+
## Strong Scaling
19+
20+
Pass `--scaling strong`. The `--memory` option controls (approximately) how much
21+
memory should be used in total during simulation, across all ranks, in Gigabytes.
22+
The problem size remains constant as the number of ranks increases.
23+
24+
## Example
25+
26+
For example, to run a weak-scaling test that uses ~4GB of GPU memory per rank
27+
on 8 2-rank nodes with case optimization, one could:
28+
29+
```shell
30+
./mfc.sh run examples/scaling/case.py -t pre_process simulation \
31+
-e batch -p mypartition -N 8 -n 2 -w "01:00:00" -# "MFC Weak Scaling" \
32+
--case-optimization -j 32 -- --scaling weak --memory 4
33+
```

examples/scaling/build.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#!/bin/bash
2+
3+
./mfc.sh build -t pre_process simulation --case-optimization -i examples/scaling/case.py \
4+
-j 8 --gpu --mpi --no-debug -- -s strong -m 512

examples/3D_weak_scaling/case.py renamed to examples/scaling/case.py

Lines changed: 39 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,45 @@
11
#!/usr/bin/env python3
22

3-
# Case file contributed by Anand Radhakrishnan and modified by Henry Le Berre
4-
# for integration as a weak scaling benchmark for MFC.
5-
6-
import json, math, argparse
3+
import sys, json, math, typing, argparse
74

85
parser = argparse.ArgumentParser(
9-
prog="3D_weak_scaling",
10-
description="This MFC case was created for the purposes of weak scaling.",
6+
prog="scaling",
7+
description="Weak- and strong-scaling benchmark case.",
118
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
129

13-
parser.add_argument("dict", type=str, metavar="DICT", help=argparse.SUPPRESS)
14-
parser.add_argument("gbpp", type=int, metavar="MEM", default=16, help="Adjusts the problem size per rank to fit into [MEM] GB of GPU memory per GPU.")
10+
parser.add_argument("dict", type=str, metavar="DICT")
11+
parser.add_argument("-s", "--scaling", type=str, metavar="SCALING", choices=["weak", "strong"], help="Whether weak- or strong-scaling is being exercised.")
12+
parser.add_argument("-m", "--memory", type=int, metavar="MEMORY", help="Weak scaling: memory per rank in GB. Strong scaling: global memory in GB. Used to determine cell count.")
13+
parser.add_argument("-f", "--fidelity", type=str, metavar="FIDELITY", choices=["ideal", "exact"], default="ideal")
14+
parser.add_argument("--rdma_mpi", type=str, metavar="FIDELITY", choices=["T", "F"], default="F")
15+
parser.add_argument("--n-steps", type=int, metavar="N", default=None)
16+
17+
args = parser.parse_args()
18+
19+
if args.scaling is None:
20+
parser.print_help()
21+
sys.exit(1)
22+
23+
DICT = json.loads(args.dict)
1524

16-
ARGS = vars(parser.parse_args())
17-
DICT = json.loads(ARGS["dict"])
25+
# \approx The number of cells per GB of memory. The exact value is not important.
26+
cpg = 8000000 / 16.0
27+
# Number of ranks.
28+
nranks = DICT["nodes"] * DICT["tasks_per_node"]
1829

19-
ppg = 8000000 / 16.0
20-
procs = DICT["nodes"] * DICT["tasks_per_node"]
21-
ncells = math.floor(ppg * procs * ARGS["gbpp"])
22-
s = math.floor((ncells / 2.0) ** (1/3))
23-
Nx, Ny, Nz = 2*s, s, s
30+
def nxyz_from_ncells(ncells: float) -> typing.Tuple[int, int, int]:
31+
s = math.floor((ncells / 2.0) ** (1/3))
32+
return 2*s, s, s
2433

25-
# athmospheric pressure - Pa (used as reference value)
34+
if args.scaling == "weak":
35+
if args.fidelity == "ideal":
36+
raise RuntimeError("ask ben")
37+
else:
38+
Nx, Ny, Nz = nxyz_from_ncells(cpg * nranks * args.memory)
39+
else:
40+
Nx, Ny, Nz = nxyz_from_ncells(cpg * args.memory)
41+
42+
# Atmospheric pressure - Pa (used as reference value)
2643
patm = 101325
2744

2845
# Initial Droplet Diameter / Reference length - m
@@ -162,7 +179,8 @@
162179
AS = int( NtA // SF + 1 )
163180

164181
# Nt = total number of steps. Note that Nt >= NtA (so at least tendA is completely simulated)
165-
Nt = AS * SF
182+
Nt = args.n_steps or (AS * SF)
183+
SF = min( SF, Nt )
166184

167185
# total simulation time - s. Note that tend >= tendA
168186
tend = Nt * dt
@@ -171,6 +189,7 @@
171189
print(json.dumps({
172190
# Logistics ================================================
173191
'run_time_info' : 'T',
192+
'rdma_mpi' : args.rdma_mpi,
174193
# ==========================================================
175194

176195
# Computational Domain Parameters ==========================
@@ -186,8 +205,8 @@
186205
'cyl_coord' : 'F',
187206
'dt' : dt,
188207
't_step_start' : 0,
189-
't_step_stop' : int(5000*16.0/ARGS["gbpp"]),
190-
't_step_save' : int(1000*16.0/ARGS["gbpp"]),
208+
't_step_stop' : Nt,
209+
't_step_save' : SF,
191210
# ==========================================================
192211

193212
# Simulation Algorithm Parameters ==========================
@@ -201,7 +220,7 @@
201220
'time_stepper' : 3,
202221
'weno_order' : 3,
203222
'weno_eps' : 1.0E-16,
204-
'weno_Re_flux' : 'F',
223+
'weno_Re_flux' : 'F',
205224
'weno_avg' : 'F',
206225
'mapped_weno' : 'T',
207226
'riemann_solver' : 2,
@@ -283,6 +302,3 @@
283302
'fluid_pp(2)%pi_inf' : gama*pia/(gama-1),
284303
# ==========================================================
285304
}))
286-
287-
# ==============================================================================
288-

examples/scaling/export.py

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
import re, os, csv, glob, statistics
2+
3+
from dataclasses import dataclass, fields
4+
5+
CDIR=os.path.abspath(os.path.join("examples", "scaling"))
6+
LDIR=os.path.join(CDIR, "logs")
7+
8+
def get_num(s: str) -> float:
9+
try:
10+
return float(re.findall(r"[0-9]+\.[0-9]+(?:E[-+][0-9]+)?", s, re.MULTILINE)[0])
11+
except:
12+
return None
13+
14+
def get_nums(arr):
15+
return {get_num(_) for _ in arr if get_num(_)}
16+
17+
@dataclass(frozen=True, order=True)
18+
class Configuration:
19+
nodes: int
20+
mem: int
21+
rdma_mpi: bool
22+
23+
@dataclass
24+
class Result:
25+
ts_avg: float
26+
mpi_avg: float
27+
init_t: float
28+
sim_t: float
29+
30+
runs = {}
31+
32+
for logpath in glob.glob(os.path.join(LDIR, "run-*-sim*")):
33+
logdata = open(logpath, "r").read()
34+
35+
tss = get_nums(re.findall(r'^ TS .+', logdata, re.MULTILINE))
36+
mpis = get_nums(re.findall(r'^ MPI .+', logdata, re.MULTILINE))
37+
try:
38+
perf = get_num(re.findall(r"^ Performance: .+", logdata, re.MULTILINE)[0])
39+
except:
40+
perf = 'N/A'
41+
42+
if len(tss) == 0: tss = [-1.0]
43+
if len(mpis) == 0: mpis = [-1.0]
44+
45+
pathels = os.path.relpath(logpath, LDIR).split('-')
46+
47+
runs[Configuration(
48+
nodes=int(pathels[1]),
49+
mem=int(pathels[2]),
50+
rdma_mpi=pathels[3] == 'T'
51+
)] = Result(
52+
ts_avg=statistics.mean(tss),
53+
mpi_avg=statistics.mean(mpis),
54+
init_t=get_num(re.findall(r"Init took .+", logdata, re.MULTILINE)[0]),
55+
sim_t=get_num(re.findall(r"sim_duration .+", logdata, re.MULTILINE)[0]),
56+
)
57+
58+
with open(os.path.join(CDIR, "export.csv"), "w") as f:
59+
writer = csv.writer(f, delimiter=',')
60+
writer.writerow([
61+
_.name for _ in fields(Configuration) + fields(Result)
62+
])
63+
64+
for cfg in sorted(runs.keys()):
65+
writer.writerow(
66+
[ getattr(cfg, _.name) for _ in fields(Configuration) ] +
67+
[ getattr(runs[cfg], _.name) for _ in fields(Result) ]
68+
)
69+
70+
for rdma_mpi in (False, True):
71+
with open(
72+
os.path.join(CDIR, f"strong_scaling{'-rdma_mpi' if rdma_mpi else ''}.csv"),
73+
"w"
74+
) as f:
75+
writer = csv.writer(f, delimiter=',')
76+
77+
for nodes in sorted({
78+
_.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
79+
}):
80+
row = (nodes*8,)
81+
for mem in sorted({
82+
_.mem for _ in runs.keys() if _.nodes == nodes and _.rdma_mpi == rdma_mpi
83+
}, reverse=True):
84+
ref = runs[Configuration(nodes=sorted({
85+
_.nodes for _ in runs.keys() if _.rdma_mpi == rdma_mpi
86+
})[0], mem=mem, rdma_mpi=rdma_mpi)]
87+
run = runs[Configuration(nodes=nodes, mem=mem, rdma_mpi=rdma_mpi)]
88+
row = (*row,run.sim_t,ref.sim_t/nodes)
89+
90+
writer.writerow(row)

0 commit comments

Comments
 (0)