Skip to content

Commit d5e9a06

Browse files
author
Alberto Scolari
authored
335 wrong fma perf test (#336)
The fma (fused multiply-add) performance test aims to replicate a simple triad benchmark, computing `z=alpha.*x+y` out-of-place. The matching ALP/GraphBLAS primitive for that, once upon a time, was `grb::eWiseMulAdd`. Since that time, however, that primitive has undergone two changes: 1) it became in-place rather than out-of-place (i.e., computing `z+=alpha.*x+y`) while 2) since the release of the nonblocking backend this function has become deprecated (as the same could be computed using `grb::eWiseMul` followed by `grb::foldl` or `grb::foldr`, at negligible performance overheads compared to the one-shot `grb::eWiseMulAdd`). Neither change was ever reflected in the `fma` test itself, however; and, at present, the original out-of-place `fma` computation `z=alpha.*x+y` similarly has no single matching (deprecated or not) ALP/GraphBLAS primitive. The nonblocking backend, however, could implement an fma by first calling `grb::set(z,y)` followed by the in-place `grb::eWiseMul`. Additionally, since the incorporation of the CMake build infra, the compiler-optimised performance tests were erroneously not compiled with performance flags, causing a further apple-to-oranges type comparison in this performance test. The issues were detected by Alberto Scolari, who additionally indicated that adding performance tests to the CI and regularly reviewing those results would have caught this issue earlier (crossref #343 , visualisation only - the performance test now actually regularly trigger for `develop` on our internal CI). He also produced an initial set of fixes and we thank Alberto for both. This MR fixes these issues regarding the fma performance test by: 1. introducing the fma-blocking and fma-nonblocking shared-memory parallel performance test instead of a single fma-omp one. Only for fma-nonblocking is performance expected to match that with a compiler-optimised or lambda-generated fma; 2. this distinction is now also printed to screen as part of the test run; and 3. improved test initialisation checks and improved error detection. This MR also introduced some changes to the `dot` and `reduce` performance tests, in tandem with changes to the `fma` performance test. Those "joint" changes are: 4. the compiler-optimised shared-memory parallel fma (as well as the reduce and dot) performance tests now avoid critical sections (which is in line with standard HPC practices), and e.g. for reductions (reduce and dot) prefer the OpenMP `reduction`-clause; 5. the compiler-optimised shared-memory parallel fma, reduce, and dot now use the same distribution of work across threads (essentially a static block-wise schedule); 6. both sequential and shared-memory parallel compiler-optimised benchmarks are now given the proper performance flags during compilation; 7. avoid mixing `fprintf` and `stdcerr/stdcout` in the fma, dot, and reduce performance tests as well as other more minor code style fixes; 8. test output and test summary now distinguish between sequential and shared-memory parallel tests; 9. use the `dense` descriptor where appropriate (potentially enhancing performance of ALP code); 10. if any test auto-tunes the number of inner repetitions, the number is chosen so that a single experiment is expected to take no less than 100 milliseconds (down from 1 second); 11. these tests now employ vectors of 100 million elements (up from 10 million); and, finally, 12. it is double-checked that the numbers the new tests produce are on par with - or exceeding the performance of - the numbers published by Yzelman et al. (2020).
1 parent 68715e3 commit d5e9a06

File tree

7 files changed

+355
-182
lines changed

7 files changed

+355
-182
lines changed

tests/performance/CMakeLists.txt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,17 +26,24 @@ set( TEST_CATEGORY "performance" )
2626
add_library( bench_kernels OBJECT bench_kernels.c bench_kernels.h )
2727
add_library( bench_kernels_omp OBJECT bench_kernels.c bench_kernels.h )
2828
target_compile_definitions( bench_kernels_omp PRIVATE BENCH_KERNELS_OPENMP )
29+
target_link_libraries( bench_kernels PRIVATE test_performance_flags )
30+
target_link_libraries( bench_kernels_omp PRIVATE test_performance_flags OpenMP::OpenMP_C )
2931

3032
add_grb_executables( fma fma.cpp $<TARGET_OBJECTS:bench_kernels>
3133
BACKENDS reference NO_BACKEND_NAME
3234
ADDITIONAL_LINK_LIBRARIES "rt"
3335
)
3436

35-
add_grb_executables( fma-openmp fma.cpp $<TARGET_OBJECTS:bench_kernels_omp>
37+
add_grb_executables( fma-blocking fma.cpp $<TARGET_OBJECTS:bench_kernels_omp>
3638
BACKENDS reference_omp NO_BACKEND_NAME
3739
ADDITIONAL_LINK_LIBRARIES OpenMP::OpenMP_CXX "rt"
3840
)
3941

42+
add_grb_executables( fma-nonblocking fma.cpp $<TARGET_OBJECTS:bench_kernels_omp>
43+
BACKENDS nonblocking NO_BACKEND_NAME
44+
ADDITIONAL_LINK_LIBRARIES OpenMP::OpenMP_CXX "rt"
45+
)
46+
4047
add_grb_executables( reduce reduce.cpp $<TARGET_OBJECTS:bench_kernels>
4148
BACKENDS reference NO_BACKEND_NAME
4249
)

tests/performance/bench_kernels.c

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
#ifdef BENCH_KERNELS_OPENMP
2323

24+
bool bench_kernels_parallel() { return true; }
25+
2426
void bench_kernels_axpy(
2527
double * restrict a,
2628
const double alpha, const double * restrict x,
@@ -30,9 +32,25 @@ void bench_kernels_axpy(
3032
assert( a != x );
3133
assert( a != y );
3234
assert( x != y );
33-
#pragma omp parallel for schedule(static,8)
34-
for( size_t i = 0; i < n; ++i ) {
35-
a[ i ] = alpha * x[ i ] + y[ i ];
35+
#pragma omp parallel
36+
{
37+
const size_t P = omp_get_num_threads();
38+
const size_t s = omp_get_thread_num();
39+
const size_t chunk = (n % P == 0) ? (n/P) : (n/P) + 1;
40+
size_t start = chunk * s;
41+
if( start > n - 1 ) {
42+
start = n - 1;
43+
}
44+
size_t end = start + chunk;
45+
if( end > n ) {
46+
end = n;
47+
}
48+
assert( start <= end );
49+
if( start != end ) {
50+
for( size_t i = start; i < end; ++i ) {
51+
a[ i ] = alpha * x[ i ] + y[ i ];
52+
}
53+
}
3654
}
3755
}
3856

@@ -45,7 +63,8 @@ void bench_kernels_dot(
4563
assert( alpha != xr );
4664
assert( alpha != yr );
4765
*alpha = xr[ n - 1 ] * yr[ n - 1];
48-
#pragma omp parallel
66+
double global_alpha = 0;
67+
#pragma omp parallel reduction(+:global_alpha)
4968
{
5069
const size_t P = omp_get_num_threads();
5170
const size_t s = omp_get_thread_num();
@@ -64,20 +83,19 @@ void bench_kernels_dot(
6483
for( size_t i = start; i < end - 1; ++i ) {
6584
local_alpha += xr[ i ] * yr[ i ];
6685
}
67-
#pragma omp critical
68-
{
69-
*alpha += local_alpha;
70-
}
86+
global_alpha += local_alpha;
7187
}
7288
}
89+
*alpha += global_alpha;
7390
}
7491

7592
void bench_kernels_reduce(
7693
double * restrict const alpha, const double * restrict xr, const size_t n
7794
) {
7895
assert( alpha != xr );
7996
*alpha = xr[ n - 1 ];
80-
#pragma omp parallel
97+
double global_alpha = 0.0;
98+
#pragma omp parallel reduction(+:global_alpha)
8199
{
82100
const size_t P = omp_get_num_threads();
83101
const size_t s = omp_get_thread_num();
@@ -96,16 +114,16 @@ void bench_kernels_reduce(
96114
for( size_t i = start; i < end - 1; ++i ) {
97115
local_alpha += xr[ i ];
98116
}
99-
#pragma omp critical
100-
{
101-
*alpha += local_alpha;
102-
}
117+
global_alpha += local_alpha;
103118
}
104119
}
120+
*alpha += global_alpha;
105121
}
106122

107123
#else
108124

125+
bool bench_kernels_parallel() { return false; }
126+
109127
void bench_kernels_axpy(
110128
double * restrict a,
111129
const double alpha, const double * restrict x,

tests/performance/bench_kernels.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
#include <omp.h>
2020
#include <assert.h>
21-
#include <stddef.h> //for size_t
21+
#include <stddef.h> // for size_t
2222

2323

2424
#ifdef __cplusplus
@@ -41,10 +41,14 @@ extern "C" {
4141
double * __restrict__ const, const double * __restrict__, const size_t
4242
);
4343

44+
bool bench_kernels_parallel();
45+
4446
}
4547

4648
#else
4749

50+
#include <stdbool.h> // for bool
51+
4852
/**
4953
* Executes \f$ a = \alpha x + y \f$ for \a a, \a x, and \a y vectors of
5054
* length \a n.
@@ -89,5 +93,8 @@ void bench_kernels_reduce(
8993
double * restrict const alpha, const double * restrict x, const size_t n
9094
);
9195

96+
/** @returns Whether the kernels defined here are (shared-memory) parallel. */
97+
bool bench_kernels_parallel();
98+
9299
#endif
93100

0 commit comments

Comments
 (0)