Skip to content

Commit 371c51a

Browse files
authored
Merge pull request #256 from wilfonba/io
File Per Process IO, performance summary in docs, new example case.
2 parents 82af415 + 61688cd commit 371c51a

27 files changed

+2306
-220
lines changed

CMakeLists.txt

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -446,15 +446,16 @@ if (MFC_SYSCHECK)
446446
endif()
447447

448448
if (MFC_DOCUMENTATION)
449-
# Files in docs/examples are used to generate docs/documentation/examples.md
450-
file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples/*")
449+
# Files in examples/ are used to generate docs/documentation/examples.md
450+
file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/examples/*")
451451

452452
add_custom_command(
453453
OUTPUT "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md"
454-
DEPENDS "${examples_DOCs}"
454+
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh;${examples_DOCs}"
455455
COMMAND "bash" "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh"
456456
"${CMAKE_CURRENT_SOURCE_DIR}"
457457
COMMENT "Generating examples.md"
458+
VERBATIM
458459
)
459460

460461
file(GLOB common_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/*")
@@ -486,7 +487,7 @@ if (MFC_DOCUMENTATION)
486487
"${CMAKE_CURRENT_BINARY_DIR}/${target}-Doxyfile" @ONLY)
487488

488489
set(opt_example_dependency "")
489-
if (target STREQUAL "documentation")
490+
if (${target} STREQUAL documentation)
490491
set(opt_example_dependency "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md")
491492
endif()
492493

docs/documentation/case.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,7 @@ Note that `time_stepper` $=$ 3 specifies the total variation diminishing (TVD),
344344
| `format` | Integer | Output format. [1]: Silo-HDF5; [2] Binary |
345345
| `precision` | Integer | [1] Single; [2] Double |
346346
| `parallel_io` | Logical | Parallel I/O |
347+
| `file_per_process` | Logical | Whether or not to write one IO file per process |
347348
| `cons_vars_wrt` | Logical | Write conservative variables |
348349
| `prim_vars_wrt` | Logical | Write primitive variables |
349350
| `alpha_rho_wrt(i)` | Logical | Add the partial density of the fluid $i$ to the database \|
@@ -377,7 +378,10 @@ The table lists formatted database output parameters. The parameters define vari
377378
With parallel I/O, MFC inputs and outputs a single file throughout pre-process, simulation, and post-process, regardless of the number of processors used.
378379
Parallel I/O enables the use of different number of processors in each of the processes (i.e. simulation data generated using 1000 processors can be post-processed using a single processor).
379380

380-
- `cons_vars_wrt` and `prim_vars_wrt} activate output of conservative and primitive state variables into the database, respectively.
381+
- `file_per_process` deactivates shared file MPI-IO and activates file per process MPI-IO. The default behaviour is to use a shared file.
382+
File per process is usefull when running on 10's of thousands of ranks.
383+
384+
- `cons_vars_wrt` and `prim_vars_wrt` activate output of conservative and primitive state variables into the database, respectively.
381385

382386
- `[variable's name]_wrt` activates output of the each specified variable into the database.
383387

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Performance Results
2+
3+
MFC has been extensively benchmarked on CPUs and GPU devices.
4+
A summary of these results follows.
5+
6+
## Expected time-steps/hour
7+
8+
The following table outlines expected performance in terms of the number of time steps per hour
9+
(rounded to the nearest hundred) for various problem sizes (grid cells) and hardware for an inviscid, 6-equation (`model_eqns' : 3`), 3D simulation.
10+
CPU results utilize an entire die.
11+
12+
| Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer |
13+
| ---: | :----: | :----: | :---: | :---: | :----: | :--- |
14+
| NVIDIA V100 | 1 | 88.5k | 18.7k | N/A | NVHPC 22.11 | PACE Phoenix |
15+
| NVIDIA V100 | 1 | 78.8k | 18.8k | N/A | NVHPC 22.11 | OLCF Summit |
16+
| NVIDIA A100 | 1 | 114.4k | 34.6k | 16.5k | NVHPC 23.5 | Wingtip |
17+
| AMD MI250X | 1 | 77.5k | 22.3k | 11.2k | CCE 16.0.1 | OLCF Frontier |
18+
| Intel Xeon Gold 6226 | 12 | 2.5k | 0.7k | 0.4k | GNU 10.3.0 | PACE Phoenix |
19+
| Apple Silicon M2 | 6 | 2.8k | 0.6k | 0.2k | GNU 13.2.0 | N/A |
20+
21+
If `'model_eqns' : 3` is replaced by `'model_eqns' : 2`, an inviscid 5-equation model is used.
22+
The following table outlines expected performance in terms of the number of time-steps per hour (rounded to the nearest hundred) for various problem sizes and hardware for an inviscid, 5-equation,
23+
3D simulation.
24+
CPU results utilize an entire die.
25+
26+
| Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer |
27+
| ---: | :----: | :----: | :---: | :---: | :----: | :--- |
28+
| NVIDIA V100 | 1 | 113.4k | 26.2k | 13.0k | NVHPC 22.11 | PACE Phoenix |
29+
| NVIDIA V100 | 1 | 107.7k | 26.3k | 13.1k | NVHPC 22.11 | OLCF Summit |
30+
| NVIDIA A100 | 1 | 153.5k | 48.0k | 22.5k | NVHPC 23.5 | Wingtip |
31+
| AMD MI250X | 1 | 104.2k | 31.0k | 14.8k | CCE 16.0.1 | OLCF Frontier |
32+
| Intel Xeon Gold 6226 | 12 | 5.4k | 1.6k | 0.8k | GNU 10.3.0 | PACE Phoenix |
33+
| Apple Silicon M2 | 6 | 3.7k | 11.0k | 0.3k | GNU 13.2.0 | N/A |
34+
35+
## Weak scaling
36+
37+
Weak scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant.
38+
39+
### AMD MI250X GPU
40+
41+
MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency.
42+
This corresponds to 87% of the entire machine.
43+
44+
<img src="../res/weakScaling/frontier.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
45+
46+
### NVIDIA V100 GPU
47+
48+
MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency.
49+
This corresponds to 50% of the entire machine.
50+
51+
<img src="../res/weakScaling/summit.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
52+
53+
### IBM Power9 CPU
54+
MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling.
55+
56+
<img src="../res/weakScaling/cpuScaling.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
57+
58+
## Strong scaling
59+
60+
Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases.
61+
62+
### NVIDIA V100 GPU
63+
64+
For these tests, the base case utilizes 8 GPUs with one MPI process per GPU.
65+
The performance is analyzed at two different problem sizes of 16M and 64M grid points, with the base case using 2M and 8M grid points per process.
66+
67+
#### 16M Grid Points
68+
69+
<img src="../res/strongScaling/strongScaling16.svg" style="width: 50%; border-radius: 10pt"/>
70+
71+
#### 64M Grid Points
72+
<img src="../res/strongScaling/strongScaling64.svg" style="width: 50%; border-radius: 10pt"/>
73+
74+
### IBM Power9 CPU
75+
76+
CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process.
77+
78+
<img src="../res/strongScaling/cpuStrongScaling.svg" style="width: 50%; border-radius: 10pt"/>

docs/documentation/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
- [Example Cases](examples.md)
99
- [Running MFC](running.md)
1010
- [Flow Visualisation](visualisation.md)
11+
- [Performance Results](expectedPerformance.md)
1112
- [MFC's Authors](authors.md)
1213
- [References](references.md)
1314

0 commit comments

Comments
 (0)