|
1 | 1 | # Performance Results |
2 | 2 |
|
3 | | -MFC has been extensively benchmarked on both CPUs and GPUs. A summary of these results follow. |
| 3 | +MFC has been extensively benchmarked on CPUs and GPU devices. |
| 4 | +A summary of these results follows. |
4 | 5 |
|
5 | 6 | ## Expected time-steps/hour |
6 | 7 |
|
7 | | -The following table outlines expected performance in terms of number of time-steps per hour |
8 | | -(rounded to the nearest hundred) for various problem sizes and hardware for a inviscid, 6-equation, |
9 | | -3D simulation. CPU results utilize an entire die. |
| 8 | +The following table outlines expected performance in terms of the number of time steps per hour |
| 9 | +(rounded to the nearest hundred) for various problem sizes (grid cells) and hardware for an inviscid, 6-equation (`model_eqns' : 3`), 3D simulation. |
| 10 | +CPU results utilize an entire die. |
10 | 11 |
|
11 | 12 | | Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer | |
12 | 13 | | ---: | :----: | :----: | :---: | :---: | :----: | :--- | |
13 | | -| Nvidia V100 | 1 | 88.5k | 18.7k | N/A | NVHPC 22.11 | PACE Phoenix | |
14 | | -| Nvidia V100 | 1 | 78.8k | 18.8k | N/A | NVHPC 22.11 | OLCF Summit | |
15 | | -| Nvidia A100 | 1 | 114.4k | 34.6k | 16.5k | NVHPC 23.5 | Wingtip | |
16 | | -| AMD MI250x | 1 | 77.5k | 22.3k | 11.2k | CCE 16.0.1 | OLCF Frontier | |
| 14 | +| NVIDIA V100 | 1 | 88.5k | 18.7k | N/A | NVHPC 22.11 | PACE Phoenix | |
| 15 | +| NVIDIA V100 | 1 | 78.8k | 18.8k | N/A | NVHPC 22.11 | OLCF Summit | |
| 16 | +| NVIDIA A100 | 1 | 114.4k | 34.6k | 16.5k | NVHPC 23.5 | Wingtip | |
| 17 | +| AMD MI250X | 1 | 77.5k | 22.3k | 11.2k | CCE 16.0.1 | OLCF Frontier | |
17 | 18 | | Intel Xeon Gold 6226 | 12 | 2.5k | 0.7k | 0.4k | GNU 10.3.0 | PACE Phoenix | |
18 | 19 | | Apple Silicon M2 | 6 | 2.8k | 0.6k | 0.2k | GNU 13.2.0 | N/A | |
19 | 20 |
|
20 | 21 | If `'model_eqns' : 3` is replaced by `'model_eqns' : 2`, an inviscid 5-equation model is used. |
21 | | -The following table outlines expected performance in terms of number of time-steps per hour |
22 | | -(rounded to the nearest hundred) for various problem sizes and hardware for a inviscid, 5-equation, |
23 | | -3D simulation. CPU results utilize an entire die. |
| 22 | +The following table outlines expected performance in terms of the number of time-steps per hour (rounded to the nearest hundred) for various problem sizes and hardware for an inviscid, 5-equation, |
| 23 | +3D simulation. |
| 24 | +CPU results utilize an entire die. |
24 | 25 |
|
25 | 26 | | Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer | |
26 | 27 | | ---: | :----: | :----: | :---: | :---: | :----: | :--- | |
27 | | -| Nvidia V100 | 1 | 113.4k | 26.2k | 13.0k | NVHPC 22.11 | PACE Phoenix | |
28 | | -| Nvidia V100 | 1 | 107.7k | 26.3k | 13.1k | NVHPC 22.11 | OLCF Summit | |
29 | | -| Nvidia A100 | 1 | 153.5k | 48.0k | 22.5k | NVHPC 23.5 | Wingtip | |
30 | | -| AMD MI250x | 1 | 104.2k | 31.0k | 14.8k | CCE 16.0.1 | OLCF Frontier | |
| 28 | +| NVIDIA V100 | 1 | 113.4k | 26.2k | 13.0k | NVHPC 22.11 | PACE Phoenix | |
| 29 | +| NVIDIA V100 | 1 | 107.7k | 26.3k | 13.1k | NVHPC 22.11 | OLCF Summit | |
| 30 | +| NVIDIA A100 | 1 | 153.5k | 48.0k | 22.5k | NVHPC 23.5 | Wingtip | |
| 31 | +| AMD MI250X | 1 | 104.2k | 31.0k | 14.8k | CCE 16.0.1 | OLCF Frontier | |
31 | 32 | | Intel Xeon Gold 6226 | 12 | 5.4k | 1.6k | 0.8k | GNU 10.3.0 | PACE Phoenix | |
32 | 33 | | Apple Silicon M2 | 6 | 3.7k | 11.0k | 0.3k | GNU 13.2.0 | N/A | |
33 | 34 |
|
34 | 35 | ## Weak scaling |
35 | 36 |
|
36 | | -Strong scaling results are obtained by increasing the problem size with the number of processes |
37 | | -so that work per process remains constant. |
| 37 | +Strong scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant. |
38 | 38 |
|
39 | 39 | ### AMD MI250X GPU |
40 | | -MFC weask scales to 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency. This corresponds to 87% of the entire machine. |
| 40 | + |
| 41 | +MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency. |
| 42 | +This corresponds to 87% of the entire machine. |
41 | 43 |
|
42 | 44 | <img src="../res/weakScaling/frontier.svg" style="height: 50%; width:50%; border-radius: 10pt"/> |
43 | 45 |
|
44 | | -### Nvidia V100 GPU |
45 | | -MFC weak scales to 13,824 V100 Nvidia V100 GPUs on OLCF Summit with 97% efficiency. This corresponds to 50% of the entire machine. |
| 46 | +### NVIDIA V100 GPU |
| 47 | + |
| 48 | +MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency. |
| 49 | +This corresponds to 50% of the entire machine. |
46 | 50 |
|
47 | 51 | <img src="../res/weakScaling/summit.svg" style="height: 50%; width:50%; border-radius: 10pt"/> |
48 | 52 |
|
49 | | -### IMB Power9 CPU |
| 53 | +### IBM Power9 CPU |
50 | 54 | MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling. |
51 | 55 |
|
52 | 56 | <img src="../res/weakScaling/cpuScaling.svg" style="height: 50%; width:50%; border-radius: 10pt"/> |
53 | 57 |
|
54 | 58 | ## Strong scaling |
55 | 59 |
|
56 | | -Strong scaling results are obtained by keeping the problem size constant and increasing |
57 | | -the number of process so that work per process decreases. |
| 60 | +Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases. |
58 | 61 |
|
59 | | -### Nvidia V100 GPU |
| 62 | +### NVIDIA V100 GPU |
60 | 63 |
|
61 | | -For these tests, the base case utilizes 8 GPUs with one MPI process per GPU. The performance |
62 | | -is analyzed at two different problem sizes of 16 and 64M grid points, with the base case using |
63 | | -2 and 8M grid points per process. |
| 64 | +For these tests, the base case utilizes 8 GPUs with one MPI process per GPU. |
| 65 | +The performance is analyzed at two different problem sizes of 16M and 64M grid points, with the base case using 2M and 8M grid points per process. |
64 | 66 |
|
65 | 67 | #### 16M Grid Points |
| 68 | + |
66 | 69 | <img src="../res/strongScaling/strongScaling16.svg" style="width: 50%; border-radius: 10pt"/> |
67 | 70 |
|
68 | 71 | #### 64M Grid Points |
69 | 72 | <img src="../res/strongScaling/strongScaling64.svg" style="width: 50%; border-radius: 10pt"/> |
70 | 73 |
|
71 | | -### IBM Power 9 CPU |
| 74 | +### IBM Power9 CPU |
72 | 75 |
|
73 | | -CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the |
74 | | -base case using 2, 4, and 8M cells per process. |
| 76 | +CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process. |
75 | 77 |
|
76 | | -<img src="../res/strongScaling/cpuStrongScaling.svg" style="width: 50%; border-radius: 10pt"/> |
| 78 | +<img src="../res/strongScaling/cpuStrongScaling.svg" style="width: 50%; border-radius: 10pt"/> |
0 commit comments