Commit d5e9a06
Alberto Scolari
335 wrong fma perf test (#336)
The fma (fused multiply-add) performance test aims to replicate a simple triad benchmark, computing `z=alpha.*x+y` out-of-place.
The matching ALP/GraphBLAS primitive for that, once upon a time, was `grb::eWiseMulAdd`. Since that time, however, that primitive has undergone two changes: 1) it became in-place rather than out-of-place (i.e., computing `z+=alpha.*x+y`) while 2) since the release of the nonblocking backend this function has become deprecated (as the same could be computed using `grb::eWiseMul` followed by `grb::foldl` or `grb::foldr`, at negligible performance overheads compared to the one-shot `grb::eWiseMulAdd`).
Neither change was ever reflected in the `fma` test itself, however; and, at present, the original out-of-place `fma` computation `z=alpha.*x+y` similarly has no single matching (deprecated or not) ALP/GraphBLAS primitive. The nonblocking backend, however, could implement an fma by first calling `grb::set(z,y)` followed by the in-place `grb::eWiseMul`.
Additionally, since the incorporation of the CMake build infra, the compiler-optimised performance tests were erroneously not compiled with performance flags, causing a further apple-to-oranges type comparison in this performance test.
The issues were detected by Alberto Scolari, who additionally indicated that adding performance tests to the CI and regularly reviewing those results would have caught this issue earlier (crossref #343 , visualisation only - the performance test now actually regularly trigger for `develop` on our internal CI). He also produced an initial set of fixes and we thank Alberto for both.
This MR fixes these issues regarding the fma performance test by:
1. introducing the fma-blocking and fma-nonblocking shared-memory parallel performance test instead of a single fma-omp one. Only for fma-nonblocking is performance expected to match that with a compiler-optimised or lambda-generated fma;
2. this distinction is now also printed to screen as part of the test run; and
3. improved test initialisation checks and improved error detection.
This MR also introduced some changes to the `dot` and `reduce` performance tests, in tandem with changes to the `fma` performance test. Those "joint" changes are:
4. the compiler-optimised shared-memory parallel fma (as well as the reduce and dot) performance tests now avoid critical sections (which is in line with standard HPC practices), and e.g. for reductions (reduce and dot) prefer the OpenMP `reduction`-clause;
5. the compiler-optimised shared-memory parallel fma, reduce, and dot now use the same distribution of work across threads (essentially a static block-wise schedule);
6. both sequential and shared-memory parallel compiler-optimised benchmarks are now given the proper performance flags during compilation;
7. avoid mixing `fprintf` and `stdcerr/stdcout` in the fma, dot, and reduce performance tests as well as other more minor code style fixes;
8. test output and test summary now distinguish between sequential and shared-memory parallel tests;
9. use the `dense` descriptor where appropriate (potentially enhancing performance of ALP code);
10. if any test auto-tunes the number of inner repetitions, the number is chosen so that a single experiment is expected to take no less than 100 milliseconds (down from 1 second);
11. these tests now employ vectors of 100 million elements (up from 10 million); and, finally,
12. it is double-checked that the numbers the new tests produce are on par with - or exceeding the performance of - the numbers published by Yzelman et al. (2020).1 parent 68715e3 commit d5e9a06
File tree
7 files changed
+355
-182
lines changed- tests/performance
7 files changed
+355
-182
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
35 | | - | |
| 37 | + | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
40 | 47 | | |
41 | 48 | | |
42 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
30 | 32 | | |
31 | 33 | | |
32 | 34 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
36 | 54 | | |
37 | 55 | | |
38 | 56 | | |
| |||
45 | 63 | | |
46 | 64 | | |
47 | 65 | | |
48 | | - | |
| 66 | + | |
| 67 | + | |
49 | 68 | | |
50 | 69 | | |
51 | 70 | | |
| |||
64 | 83 | | |
65 | 84 | | |
66 | 85 | | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
| 86 | + | |
71 | 87 | | |
72 | 88 | | |
| 89 | + | |
73 | 90 | | |
74 | 91 | | |
75 | 92 | | |
76 | 93 | | |
77 | 94 | | |
78 | 95 | | |
79 | 96 | | |
80 | | - | |
| 97 | + | |
| 98 | + | |
81 | 99 | | |
82 | 100 | | |
83 | 101 | | |
| |||
96 | 114 | | |
97 | 115 | | |
98 | 116 | | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
| 117 | + | |
103 | 118 | | |
104 | 119 | | |
| 120 | + | |
105 | 121 | | |
106 | 122 | | |
107 | 123 | | |
108 | 124 | | |
| 125 | + | |
| 126 | + | |
109 | 127 | | |
110 | 128 | | |
111 | 129 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
44 | 46 | | |
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| 50 | + | |
| 51 | + | |
48 | 52 | | |
49 | 53 | | |
50 | 54 | | |
| |||
89 | 93 | | |
90 | 94 | | |
91 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
92 | 99 | | |
93 | 100 | | |
0 commit comments