Skip to content

Commit 9bb1350

Browse files
authored
Add GPU backend synchronization to benchmarks for accurate timing (#24)
1 parent a011ca4 commit 9bb1350

File tree

7 files changed

+155
-41
lines changed

7 files changed

+155
-41
lines changed

benchmarks/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,6 @@ Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
33
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
44
DeviceSparseArrays = "da3fe0eb-88a8-4d14-ae1a-857c283e9c70"
55
JLArrays = "27aeb0d3-9eb9-45fb-866b-73c2ecf80fcb"
6+
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
67
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
78
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

benchmarks/README.md

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@ This directory contains benchmark tracking for the DeviceSparseArrays.jl package
66

77
- `Project.toml`: Dependencies for running benchmarks
88
- `runbenchmarks.jl`: Main script that runs all benchmarks
9+
- `benchmark_utils.jl`: Utility functions for benchmarking (synchronization helpers)
910
- `vector_benchmarks.jl`: Benchmarks for sparse vector operations
1011
- `matrix_benchmarks.jl`: Benchmarks for sparse matrix operations
12+
- `conversion_benchmarks.jl`: Benchmarks for format conversion operations
1113

1214
## Benchmarks Tracked
1315

@@ -23,6 +25,11 @@ All matrix operations are benchmarked for CSC, CSR, and COO formats to compare t
2325
- **Matrix-Vector Multiplication**: `mul!(y, A, x)` for sparse matrix A and dense vectors x, y
2426
- **Matrix-Matrix Multiplication**: `mul!(C, A, B)` for sparse matrix A and dense matrix B
2527
- **Three-argument dot**: `dot(x, A, y)` for sparse matrix A and dense vectors x, y
28+
- **Sparse + Dense Addition**: `A + B` for sparse matrix A and dense matrix B
29+
30+
### Format Conversions
31+
- **CSC ↔ COO**: Conversions between Compressed Sparse Column and Coordinate formats
32+
- **CSR ↔ COO**: Conversions between Compressed Sparse Row and Coordinate formats
2633

2734
## Array Types
2835

@@ -88,20 +95,46 @@ To add new benchmarks:
8895
SUITE[group_name] = BenchmarkGroup()
8996
end
9097

91-
SUITE[group_name]["Test Case [$array_type_name]"] =
92-
@benchmarkable operation($adapted_data)
98+
# IMPORTANT: Wrap operations with synchronization for accurate GPU timing
99+
SUITE[group_name]["Test Case [$array_type_name]"] = @benchmarkable begin
100+
operation($adapted_data)
101+
_synchronize_backend($adapted_data)
102+
end
93103

94104
return nothing
95105
end
96106
```
97107
3. Call your function in `runbenchmarks.jl` for each array type
98108
4. Test locally with `make benchmark`
99109

110+
## GPU Synchronization
111+
112+
All benchmarks include backend synchronization to ensure accurate timing on GPU backends. GPU operations are often asynchronous, meaning they may return before the computation completes. Without synchronization, benchmarks would underestimate the actual execution time.
113+
114+
The `_synchronize_backend(arr)` helper function:
115+
- Calls `KernelAbstractions.synchronize(get_backend(arr))` for arrays supporting KernelAbstractions
116+
- Is a no-op for CPU arrays and arrays without KernelAbstractions support
117+
- Safely handles any array type, even those without `get_backend` defined
118+
119+
This approach works for:
120+
- **CPU arrays**: No synchronization needed (no-op)
121+
- **GPU arrays with KernelAbstractions**: Proper synchronization
122+
- **Other array types**: Gracefully degrades to no-op
123+
124+
All benchmarks follow the pattern:
125+
```julia
126+
@benchmarkable begin
127+
my_operation(...)
128+
_synchronize_backend($some_array)
129+
end
130+
```
131+
100132
## Notes
101133

102-
- Benchmarks use `BLAS.set_num_threads(1)` to ensure consistent results
103-
- Default parameters: N=10000, T=Float64, 5% sparsity
134+
- Benchmarks use `BLAS.set_num_threads(2)` to ensure consistent results
135+
- Default parameters: N=10000, T=Float64, 1% sparsity
104136
- Parameters can be customized via keyword arguments
105137
- Array types are detected automatically (JLArrays is optional)
106138
- Results are saved in JSON format compatible with github-action-benchmark
107139
- CUDA benchmarks are not included as GitHub Actions runners don't have GPU support
140+
- All benchmarks include backend synchronization for accurate GPU timing (see "GPU Synchronization" section)

benchmarks/benchmark_utils.jl

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""
2+
_synchronize_backend(arr)
3+
4+
Synchronize the backend associated with array `arr` to ensure all operations
5+
have completed before benchmarking continues. This is essential for accurate
6+
GPU timing.
7+
8+
# Implementation
9+
This function uses multiple dispatch to handle different array types:
10+
- For arrays with KernelAbstractions backends, it calls `synchronize` on the backend
11+
- For other array types, it is a no-op (fallback method)
12+
- New array types can extend this function by adding methods for specific types
13+
14+
# Examples
15+
```julia
16+
# GPU array with KernelAbstractions - will synchronize
17+
gpu_arr = adapt(CuArray, DeviceSparseVector(...))
18+
_synchronize_backend(gpu_arr)
19+
20+
# CPU array or arrays without KernelAbstractions - no-op
21+
cpu_arr = DeviceSparseVector(...)
22+
_synchronize_backend(cpu_arr)
23+
24+
# Extend for custom array types:
25+
# _synchronize_backend(arr::MyCustomArray) = my_custom_sync(arr)
26+
```
27+
"""
28+
_synchronize_backend(arr) = nothing # Fallback: no-op for arrays without KernelAbstractions
29+
30+
"""
31+
_synchronize_backend(arr::AbstractDeviceSparseArray)
32+
33+
Synchronize KernelAbstractions backend for DeviceSparseArray types.
34+
"""
35+
function _synchronize_backend(arr::AbstractDeviceSparseArray)
36+
backend = KernelAbstractions.get_backend(arr)
37+
KernelAbstractions.synchronize(backend)
38+
return nothing
39+
end

benchmarks/conversion_benchmarks.jl

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -33,20 +33,28 @@ function benchmark_conversions!(
3333
dsm_coo = adapt(array_constructor, sm_coo)
3434

3535
# CSC → COO conversion
36-
SUITE["Format Conversions"][array_type_name]["CSC → COO"] =
37-
@benchmarkable DeviceSparseMatrixCOO($dsm_csc)
36+
SUITE["Format Conversions"][array_type_name]["CSC → COO"] = @benchmarkable begin
37+
DeviceSparseMatrixCOO($dsm_csc)
38+
_synchronize_backend($dsm_csc)
39+
end
3840

3941
# COO → CSC conversion
40-
SUITE["Format Conversions"][array_type_name]["COO → CSC"] =
41-
@benchmarkable DeviceSparseMatrixCSC($dsm_coo)
42+
SUITE["Format Conversions"][array_type_name]["COO → CSC"] = @benchmarkable begin
43+
DeviceSparseMatrixCSC($dsm_coo)
44+
_synchronize_backend($dsm_coo)
45+
end
4246

4347
# CSR → COO conversion
44-
SUITE["Format Conversions"][array_type_name]["CSR → COO"] =
45-
@benchmarkable DeviceSparseMatrixCOO($dsm_csr)
48+
SUITE["Format Conversions"][array_type_name]["CSR → COO"] = @benchmarkable begin
49+
DeviceSparseMatrixCOO($dsm_csr)
50+
_synchronize_backend($dsm_csr)
51+
end
4652

4753
# COO → CSR conversion
48-
SUITE["Format Conversions"][array_type_name]["COO → CSR"] =
49-
@benchmarkable DeviceSparseMatrixCSR($dsm_coo)
54+
SUITE["Format Conversions"][array_type_name]["COO → CSR"] = @benchmarkable begin
55+
DeviceSparseMatrixCSR($dsm_coo)
56+
_synchronize_backend($dsm_coo)
57+
end
5058

5159
return nothing
5260
end

benchmarks/matrix_benchmarks.jl

Lines changed: 50 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,20 @@ function benchmark_matrix_vector_mul!(
3737
x_vec = adapt(array_constructor, randn(T, N))
3838

3939
# Level 3: Format (CSC, CSR, COO - will be plotted together)
40-
SUITE["Matrix-Vector Multiplication"][array_type_name]["CSC"] =
41-
@benchmarkable mul!($vec, $dsm_csc, $x_vec)
40+
SUITE["Matrix-Vector Multiplication"][array_type_name]["CSC"] = @benchmarkable begin
41+
mul!($vec, $dsm_csc, $x_vec)
42+
_synchronize_backend($dsm_csc)
43+
end
4244

43-
SUITE["Matrix-Vector Multiplication"][array_type_name]["CSR"] =
44-
@benchmarkable mul!($vec, $dsm_csr, $x_vec)
45+
SUITE["Matrix-Vector Multiplication"][array_type_name]["CSR"] = @benchmarkable begin
46+
mul!($vec, $dsm_csr, $x_vec)
47+
_synchronize_backend($dsm_csr)
48+
end
4549

46-
SUITE["Matrix-Vector Multiplication"][array_type_name]["COO"] =
47-
@benchmarkable mul!($vec, $dsm_coo, $x_vec)
50+
SUITE["Matrix-Vector Multiplication"][array_type_name]["COO"] = @benchmarkable begin
51+
mul!($vec, $dsm_coo, $x_vec)
52+
_synchronize_backend($dsm_coo)
53+
end
4854

4955
return nothing
5056
end
@@ -91,14 +97,20 @@ function benchmark_matrix_matrix_mul!(
9197
result_mat = adapt(array_constructor, zeros(T, N, M))
9298

9399
# Level 3: Format (CSC, CSR, COO - will be plotted together)
94-
SUITE["Matrix-Matrix Multiplication"][array_type_name]["CSC"] =
95-
@benchmarkable mul!($result_mat, $dsm_csc, $mat)
100+
SUITE["Matrix-Matrix Multiplication"][array_type_name]["CSC"] = @benchmarkable begin
101+
mul!($result_mat, $dsm_csc, $mat)
102+
_synchronize_backend($dsm_csc)
103+
end
96104

97-
SUITE["Matrix-Matrix Multiplication"][array_type_name]["CSR"] =
98-
@benchmarkable mul!($result_mat, $dsm_csr, $mat)
105+
SUITE["Matrix-Matrix Multiplication"][array_type_name]["CSR"] = @benchmarkable begin
106+
mul!($result_mat, $dsm_csr, $mat)
107+
_synchronize_backend($dsm_csr)
108+
end
99109

100-
SUITE["Matrix-Matrix Multiplication"][array_type_name]["COO"] =
101-
@benchmarkable mul!($result_mat, $dsm_coo, $mat)
110+
SUITE["Matrix-Matrix Multiplication"][array_type_name]["COO"] = @benchmarkable begin
111+
mul!($result_mat, $dsm_coo, $mat)
112+
_synchronize_backend($dsm_coo)
113+
end
102114

103115
return nothing
104116
end
@@ -142,14 +154,20 @@ function benchmark_three_arg_dot!(
142154
y_vec = adapt(array_constructor, randn(T, N))
143155

144156
# Level 3: Format (CSC, CSR, COO - will be plotted together)
145-
SUITE["Three-argument dot"][array_type_name]["CSC"] =
146-
@benchmarkable dot($x_vec, $dsm_csc, $y_vec)
157+
SUITE["Three-argument dot"][array_type_name]["CSC"] = @benchmarkable begin
158+
dot($x_vec, $dsm_csc, $y_vec)
159+
_synchronize_backend($dsm_csc)
160+
end
147161

148-
SUITE["Three-argument dot"][array_type_name]["CSR"] =
149-
@benchmarkable dot($x_vec, $dsm_csr, $y_vec)
162+
SUITE["Three-argument dot"][array_type_name]["CSR"] = @benchmarkable begin
163+
dot($x_vec, $dsm_csr, $y_vec)
164+
_synchronize_backend($dsm_csr)
165+
end
150166

151-
SUITE["Three-argument dot"][array_type_name]["COO"] =
152-
@benchmarkable dot($x_vec, $dsm_coo, $y_vec)
167+
SUITE["Three-argument dot"][array_type_name]["COO"] = @benchmarkable begin
168+
dot($x_vec, $dsm_coo, $y_vec)
169+
_synchronize_backend($dsm_coo)
170+
end
153171

154172
return nothing
155173
end
@@ -192,14 +210,20 @@ function benchmark_sparse_dense_add!(
192210
dense_mat = adapt(array_constructor, randn(T, N, N))
193211

194212
# Level 3: Format (CSC, CSR, COO - will be plotted together)
195-
SUITE["Sparse + Dense Addition"][array_type_name]["CSC"] =
196-
@benchmarkable $dsm_csc + $dense_mat
197-
198-
SUITE["Sparse + Dense Addition"][array_type_name]["CSR"] =
199-
@benchmarkable $dsm_csr + $dense_mat
200-
201-
SUITE["Sparse + Dense Addition"][array_type_name]["COO"] =
202-
@benchmarkable $dsm_coo + $dense_mat
213+
SUITE["Sparse + Dense Addition"][array_type_name]["CSC"] = @benchmarkable begin
214+
$dsm_csc + $dense_mat
215+
_synchronize_backend($dsm_csc)
216+
end
217+
218+
SUITE["Sparse + Dense Addition"][array_type_name]["CSR"] = @benchmarkable begin
219+
$dsm_csr + $dense_mat
220+
_synchronize_backend($dsm_csr)
221+
end
222+
223+
SUITE["Sparse + Dense Addition"][array_type_name]["COO"] = @benchmarkable begin
224+
$dsm_coo + $dense_mat
225+
_synchronize_backend($dsm_coo)
226+
end
203227

204228
return nothing
205229
end

benchmarks/runbenchmarks.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,15 @@ using SparseArrays
44
using DeviceSparseArrays
55
using Adapt
66
using JLArrays
7+
using KernelAbstractions
78

89
BLAS.set_num_threads(2)
910

1011
const SUITE = BenchmarkGroup()
1112

13+
# Include utility functions
14+
include("benchmark_utils.jl")
15+
1216
# Include benchmark files
1317
include("vector_benchmarks.jl")
1418
include("matrix_benchmarks.jl")

benchmarks/vector_benchmarks.jl

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@ function benchmark_vector_sum!(
2424
dsv = adapt(array_constructor, DeviceSparseVector(sv))
2525

2626
# Level 3: Specific operation (will be plotted together)
27-
SUITE["Sparse Vector"][array_type_name]["Sum"] = @benchmarkable sum($dsv)
27+
SUITE["Sparse Vector"][array_type_name]["Sum"] = @benchmarkable begin
28+
sum($dsv)
29+
_synchronize_backend($dsv)
30+
end
2831

2932
return nothing
3033
end
@@ -58,8 +61,10 @@ function benchmark_vector_sparse_dense_dot!(
5861
dense_vec = adapt(array_constructor, randn(T, N))
5962

6063
# Level 3: Specific operation (will be plotted together)
61-
SUITE["Sparse Vector"][array_type_name]["Sparse-Dense dot"] =
62-
@benchmarkable dot($dsv, $dense_vec)
64+
SUITE["Sparse Vector"][array_type_name]["Sparse-Dense dot"] = @benchmarkable begin
65+
dot($dsv, $dense_vec)
66+
_synchronize_backend($dsv)
67+
end
6368

6469
return nothing
6570
end

0 commit comments

Comments
 (0)