@@ -19,12 +19,12 @@ which we'll sum up, and benchmark Julia's built-in, non-parallel `sum` function.
19
19
using Base. Threads: nthreads
20
20
using BenchmarkTools
21
21
22
- data = rand (10_000_000 * nthreads ());
22
+ data = rand (1_000_000 * nthreads ());
23
23
@btime sum ($ data);
24
24
````
25
25
26
26
````
27
- 27.834 ms (0 allocations: 0 bytes)
27
+ 2.327 ms (0 allocations: 0 bytes)
28
28
29
29
````
30
30
@@ -92,11 +92,11 @@ nthreads()
92
92
````
93
93
94
94
````
95
- 348.539 ms (221 allocations: 18.47 KiB)
95
+ 52.919 ms (221 allocations: 18.47 KiB)
96
96
97
97
````
98
98
99
- A ** slowdown** ?! Clearly, that's the opposite of what we tried to achieve!
99
+ A (huge) ** slowdown** ?! Clearly, that's the opposite of what we tried to achieve!
100
100
101
101
## The issue: False sharing
102
102
@@ -135,8 +135,8 @@ function parallel_sum_tasklocal(data; nchunks = nthreads())
135
135
@sync for (c, idcs) in enumerate (chunks (data; n = nchunks))
136
136
@spawn begin
137
137
local s = zero (eltype (data))
138
- @simd for i in idcs
139
- @inbounds s += data[i]
138
+ for i in idcs
139
+ s += data[i]
140
140
end
141
141
psums[c] = s
142
142
end
@@ -149,11 +149,11 @@ end
149
149
````
150
150
151
151
````
152
- 50.021 ms (221 allocations: 18.55 KiB)
152
+ 1.120 ms (221 allocations: 18.55 KiB)
153
153
154
154
````
155
155
156
- Finally, there is our expected speed up! 🎉
156
+ Finally, there is a speed up! 🎉
157
157
158
158
Two comments are in order.
159
159
@@ -179,14 +179,18 @@ end
179
179
````
180
180
181
181
````
182
- 51.305 ms (64 allocations: 5.72 KiB)
182
+ 893.396 μs (64 allocations: 5.72 KiB)
183
183
184
184
````
185
185
186
- This implementation has comparable performance and, more importantly, is conceptually
186
+ This implementation is conceptually
187
187
clearer in that there is no explicit modification of shared state, i.e. no ` pums[c] = s ` ,
188
188
anywhere at all. We can't run into false sharing if we don't modify shared state 😉.
189
189
190
+ Note that since we use the built-in ` sum ` function, which is highly optimized, we might see
191
+ better runtimes due to other effects - like SIMD and the absence of bounds checks - compared
192
+ to the simple for-loop accumulation in ` parallel_sum_tasklocal ` above.
193
+
190
194
## Parallel summation with OhMyThreads
191
195
192
196
Finally, all of the above is abstracted away for you if you simply use [ ` treduce ` ] ( @ref )
@@ -200,7 +204,7 @@ using OhMyThreads: treduce
200
204
````
201
205
202
206
````
203
- 50.873 ms (68 allocations: 5.92 KiB)
207
+ 899.097 μs (68 allocations: 5.92 KiB)
204
208
205
209
````
206
210
0 commit comments