Skip to content

Commit fbdaecf

Browse files
authored
Create automated benchmark for OLCF Frontier (#998)
1 parent 1bf4e9a commit fbdaecf

File tree

12 files changed

+795
-248
lines changed

12 files changed

+795
-248
lines changed

examples/scaling/FRONTIER_BENCH.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Description
2+
3+
The scripts and case file in this directory are set up to benchmarking strong
4+
and weak scaling performance as well as single device absolute performance on
5+
OLCF Frontier. The case file is for a three dimensional, two fluid liquid--gas
6+
problem without viscosity or surface tension. The scripts contained here have
7+
been tested for the default node counts and problem sizes in the scripts. The
8+
reference data in `reference.dat` also makes use of the default node counts and
9+
problem sizes and will need to be regenerated if either changes. The benchmarks
10+
can be run with the following steps:
11+
12+
## Getting the code
13+
14+
The code is hosted on GitHub and can be cloned with the following command:
15+
16+
```bash
17+
git clone [email protected]:MFlowCode/MFC.git; cd MFC; chmod u+x examples/scaling/*.sh;
18+
```
19+
20+
The above command clones the repository, changes directories in the repository
21+
root, and makes the benchmark scripts executable.
22+
23+
## Running the benchmarks
24+
25+
### Step 1: Building
26+
27+
The code for the benchmarks is built with the following command
28+
```
29+
./examples/scaling/build.sh
30+
```
31+
32+
### Step 2: Running
33+
34+
The benchmarks can be run in their default configuration with the following
35+
```
36+
./examples/scaling/submit_all.sh --account <account_name>
37+
```
38+
By default this will submit the following jobs for benchmarking
39+
40+
| Job | Nodes | Description |
41+
| ------------------ | ----- | ------------------------------------------------------------------- |
42+
| `MFC-W-16-64` | 16 | Weak scaling calculation with a ~64GB problem per GCD on 16 nodes |
43+
| `MFC-W-128-64` | 128 | Weak scaling calculation with a ~64GB problem per GCD on 128 nodes |
44+
| `MFC-W-1024-64` | 1024 | Weak scaling calculation with a ~64GB problem per GCD on 1024 nodes |
45+
| `MFC-W-8192-64` | 8192 | Weak scaling calculation with a ~64GB problem per GCD on 8192 nodes |
46+
| `MFC-S-8-4096` | 8 | Strong scaling calculation with a ~4096GB problem on 8 nodes |
47+
| `MFC-S-64-4096` | 64 | Strong scaling calculation with a ~4096GB problem on 64 nodes |
48+
| `MFC-S-512-4096` | 512 | Strong scaling calculation with a ~4096GB problem on 512 nodes |
49+
| `MFC-S-4096-4096` | 4096 | Strong scaling calculation with a ~4096GB problem on 4096 nodes |
50+
| `MFC-G-8` | 1 | Single device grind time calculation with ~8GB per GCD |
51+
| `MFC-G-16` | 1 | Single device grind time calculation with ~16GB per GCD |
52+
| `MFC-G-32` | 1 | Single device grind time calculation with ~32GB per GCD |
53+
| `MFC-G-64` | 1 | Single device grind time calculation with ~64GB per GCD |
54+
Strong and weak scaling cases run `pre_process` once and then run `simulation`
55+
with and without GPU-aware MPI in a single job. Individual benchmarks can be run
56+
by calling the `submit_[strong,weak,grind].sh` scripts directly, or modifying
57+
the `submit_all.sh` script to fit your needs.
58+
59+
#### Modifying the benchmarks
60+
The submitted jobs can be modified by appending options to the `submit_all.sh`
61+
script. For examples, appending
62+
```
63+
--nodes "1,2,4,8"
64+
```
65+
to the `submit_strong.sh` and `submit_weak.sh` scripts will run the strong and
66+
weak scaling benchmarks on 1, 2, 4, and 8 nodes. Appending
67+
```
68+
--mem "x,y"
69+
```
70+
will modify the approximate problem size in terms of GB of memory
71+
(see the `submit_[strong,weak,grind].sh` for details on what this number refers
72+
to for the different types of tests).
73+
74+
### Step 3: Post processing
75+
76+
The log files can be post processed into a more human readable format with
77+
```
78+
python3 examples/scaling/analyze.py
79+
```
80+
This Python script generates a table of results in the command line with
81+
comparison to the reference data in `reference.dat`. The `rel_perf` column
82+
compares the raw run times of the current results to the reference data.
83+
Relative performance numbers small than 1.0 indicate a speedup and numbers larger
84+
than one indicate a slowdown relative to the reference data. The selected problem
85+
sizes are intended to be comparable to the tiny, small, medium, and large labels
86+
used by the SpecHPC benchmark.
87+
88+
## Common errors
89+
90+
The only common failure point identified during testing were "text file busy"
91+
errors causing job failures. These errors are intermittent and are usually
92+
resolved by resubmitting the test.

examples/scaling/README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
1-
# Strong- & Weak-scaling
1+
# Scaling and Performance test
22

33
The scaling case can exercise both weak- and strong-scaling. It
44
adjusts itself depending on the number of requested ranks.
55

6-
This directory also contains a collection of scripts used to test strong-scaling
7-
on OLCF Frontier. They required modifying MFC to collect some metrics but are
8-
meant to serve as a reference to users wishing to run similar experiments.
6+
This directory also contains a collection of scripts used to test strong and weak
7+
scaling on OLCF Frontier.
98

109
## Weak Scaling
1110

examples/scaling/analyze.py

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
import os, re
2+
import pandas as pd
3+
from io import StringIO
4+
5+
6+
def parse_time_avg(path):
7+
last_val = None
8+
pattern = re.compile(r"Time Avg =\s*([0-9.E+-]+)")
9+
with open(path) as f:
10+
for line in f:
11+
match = pattern.search(line)
12+
if match:
13+
last_val = float(match.group(1))
14+
return last_val
15+
16+
17+
def parse_grind_time(path):
18+
last_val = None
19+
pattern = re.compile(r"Performance: \s*([0-9.E+-]+)")
20+
with open(path) as f:
21+
for line in f:
22+
match = pattern.search(line)
23+
if match:
24+
last_val = float(match.group(1))
25+
return last_val
26+
27+
28+
def parse_reference_file(filename):
29+
with open(filename) as f:
30+
content = f.read()
31+
32+
records = []
33+
blocks = re.split(r"\n(?=Weak|Strong|Grind)", content.strip())
34+
35+
for block in blocks:
36+
lines = block.strip().splitlines()
37+
header = lines[0].strip()
38+
body = "\n".join(lines[1:])
39+
40+
df = pd.read_csv(StringIO(body), delim_whitespace=True)
41+
42+
if header.startswith("Weak Scaling"):
43+
# Parse metadata from header
44+
mem_match = re.search(r"Memory: ~(\d+)GB", header)
45+
rdma_match = re.search(r"RDMA: (\w)", header)
46+
memory = int(mem_match.group(1)) if mem_match else None
47+
rdma = rdma_match.group(1) if rdma_match else None
48+
49+
for _, row in df.iterrows():
50+
records.append({"scaling": "weak", "nodes": int(row["nodes"]), "memory": memory, "rdma": rdma, "phase": "sim", "time_avg": row["time_avg"], "efficiency": row["efficiency"]})
51+
52+
elif header.startswith("Strong Scaling"):
53+
mem_match = re.search(r"Memory: ~(\d+)GB", header)
54+
rdma_match = re.search(r"RDMA: (\w)", header)
55+
memory = int(mem_match.group(1)) if mem_match else None
56+
rdma = rdma_match.group(1) if rdma_match else None
57+
58+
for _, row in df.iterrows():
59+
records.append(
60+
{
61+
"scaling": "strong",
62+
"nodes": int(row["nodes"]),
63+
"memory": memory,
64+
"rdma": rdma,
65+
"phase": "sim",
66+
"time_avg": row["time_avg"],
67+
"speedup": row["speedup"],
68+
"efficiency": row["efficiency"],
69+
}
70+
)
71+
72+
elif header.startswith("Grind Time"):
73+
for _, row in df.iterrows():
74+
records.append({"scaling": "grind", "memory": int(row["memory"]), "grind_time": row["grind_time"]})
75+
76+
return pd.DataFrame(records)
77+
78+
79+
# Get log files and filter for simulation logs
80+
files = os.listdir("examples/scaling/logs/")
81+
files = [f for f in files if "sim" in f]
82+
83+
records = []
84+
for fname in files:
85+
# Remove extension
86+
parts = fname.replace(".out", "").split("-")
87+
scaling, nodes, memory, rdma, phase = parts
88+
records.append({"scaling": scaling, "nodes": int(nodes), "memory": int(memory), "rdma": rdma, "phase": phase, "file": fname})
89+
90+
df = pd.DataFrame(records)
91+
92+
ref_data = parse_reference_file("examples/scaling/reference.dat")
93+
94+
print()
95+
96+
weak_df = df[df["scaling"] == "weak"]
97+
strong_df = df[df["scaling"] == "strong"]
98+
grind_df = df[df["scaling"] == "grind"]
99+
100+
weak_ref_df = ref_data[ref_data["scaling"] == "weak"]
101+
strong_ref_df = ref_data[ref_data["scaling"] == "strong"]
102+
grind_ref_df = ref_data[ref_data["scaling"] == "grind"]
103+
104+
weak_scaling_mem = weak_df["memory"].unique()
105+
weak_scaling_rdma = weak_df["rdma"].unique()
106+
107+
for mem in weak_scaling_mem:
108+
for rdma in weak_scaling_rdma:
109+
subset = weak_df[(weak_df["memory"] == mem) & (weak_df["rdma"] == rdma)]
110+
subset = subset.sort_values(by="nodes")
111+
ref = weak_ref_df[(weak_ref_df["memory"] == mem) & (weak_ref_df["rdma"] == rdma) & (weak_ref_df["nodes"].isin(subset["nodes"]))]
112+
ref = ref.sort_values(by="nodes")
113+
114+
times = []
115+
for _, row in subset.iterrows():
116+
time_avg = parse_time_avg(os.path.join("examples/scaling/logs", row["file"]))
117+
times.append(time_avg)
118+
119+
subset = subset.copy()
120+
ref = ref.copy()
121+
subset["time_avg"] = times
122+
base_time = subset.iloc[0]["time_avg"]
123+
124+
subset["efficiency"] = base_time / subset["time_avg"]
125+
subset["rel_perf"] = subset["time_avg"] / ref["time_avg"].values
126+
print(f"Weak Scaling - Memory: ~{mem}GB, RDMA: {rdma}")
127+
print(subset[["nodes", "time_avg", "efficiency", "rel_perf"]].to_string(index=False))
128+
print()
129+
130+
strong_scaling_mem = strong_df["memory"].unique()
131+
strong_scaling_rdma = strong_df["rdma"].unique()
132+
133+
for mem in strong_scaling_mem:
134+
for rdma in strong_scaling_rdma:
135+
subset = strong_df[(strong_df["memory"] == mem) & (strong_df["rdma"] == rdma)]
136+
subset = subset.sort_values(by="nodes")
137+
138+
ref = strong_ref_df[(strong_ref_df["memory"] == mem) & (strong_ref_df["rdma"] == rdma) & (strong_ref_df["nodes"].isin(subset["nodes"]))]
139+
ref = ref.sort_values(by="nodes")
140+
141+
times = []
142+
for _, row in subset.iterrows():
143+
time_avg = parse_time_avg(os.path.join("examples/scaling/logs", row["file"]))
144+
times.append(time_avg)
145+
146+
subset = subset.copy()
147+
ref = ref.copy()
148+
subset["time_avg"] = times
149+
base_time = subset.iloc[0]["time_avg"]
150+
151+
subset["speedup"] = base_time / subset["time_avg"]
152+
subset["efficiency"] = base_time / ((subset["nodes"] / subset.iloc[0]["nodes"]) * subset["time_avg"])
153+
subset["rel_perf"] = subset["time_avg"] / ref["time_avg"].values
154+
print(f"Strong Scaling - Memory: ~{mem}GB, RDMA: {rdma}")
155+
print(subset[["nodes", "time_avg", "speedup", "efficiency", "rel_perf"]].to_string(index=False))
156+
print()
157+
158+
if not grind_df.empty:
159+
grind_mem = grind_df["memory"].unique()
160+
subset = grind_df.sort_values(by="memory")
161+
ref = grind_ref_df[(grind_ref_df["memory"].isin(subset["memory"]))]
162+
ref = ref.sort_values(by="memory")
163+
164+
times = []
165+
for _, row in subset.iterrows():
166+
grind_time = parse_grind_time(os.path.join("examples/scaling/logs", row["file"]))
167+
times.append(grind_time)
168+
169+
subset = subset.copy()
170+
ref = ref.copy()
171+
172+
subset["grind_time"] = times
173+
subset["rel_perf"] = subset["grind_time"] / ref["grind_time"].values
174+
print(f"Grind Time - Single Device")
175+
print(subset[["memory", "grind_time", "rel_perf"]].to_string(index=False))
176+
177+
print()

examples/scaling/build.sh

100644100755
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
#!/bin/bash
22

3+
. ./mfc.sh load -c f -m g
4+
35
./mfc.sh build -t pre_process simulation --case-optimization -i examples/scaling/case.py \
4-
-j 8 --gpu --mpi --no-debug -- -s strong -m 512
6+
-j 8 --gpu --mpi --no-debug -- -s strong -m 512

0 commit comments

Comments
 (0)