Skip to content

Commit 0f9c12b

Browse files
committed
fix falsesharing docs
1 parent 1f78cd4 commit 0f9c12b

File tree

2 files changed

+26
-16
lines changed

2 files changed

+26
-16
lines changed

docs/src/literate/falsesharing/falsesharing.jl

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,10 @@
1313

1414
using Base.Threads: nthreads
1515
using BenchmarkTools
16+
using ThreadPinning #hide
17+
pinthreads(:cores) #hide
1618

17-
data = rand(10_000_000 * nthreads());
19+
data = rand(1_000_000 * nthreads());
1820
@btime sum($data);
1921

2022
#
@@ -103,8 +105,8 @@ function parallel_sum_tasklocal(data; nchunks = nthreads())
103105
@sync for (c, idcs) in enumerate(chunks(data; n = nchunks))
104106
@spawn begin
105107
local s = zero(eltype(data))
106-
@simd for i in idcs
107-
@inbounds s += data[i]
108+
for i in idcs
109+
s += data[i]
108110
end
109111
psums[c] = s
110112
end
@@ -115,7 +117,7 @@ end
115117
@test sum(data) parallel_sum_tasklocal(data)
116118
@btime parallel_sum_tasklocal($data);
117119

118-
# Finally, there is our expected speed up! 🎉
120+
# Finally, there is a speed up! 🎉
119121
#
120122
# Two comments are in order.
121123
#
@@ -138,9 +140,13 @@ end
138140
@test sum(data) parallel_sum_map(data)
139141
@btime parallel_sum_map($data);
140142

141-
# This implementation has comparable performance and, more importantly, is conceptually
143+
# This implementation is conceptually
142144
# clearer in that there is no explicit modification of shared state, i.e. no `pums[c] = s`,
143145
# anywhere at all. We can't run into false sharing if we don't modify shared state 😉.
146+
#
147+
# Note that since we use the built-in `sum` function, which is highly optimized, we might see
148+
# better runtimes due to other effects - like SIMD and the absence of bounds checks - compared
149+
# to the simple for-loop accumulation in `parallel_sum_tasklocal` above.
144150

145151
#
146152
# ## Parallel summation with OhMyThreads

docs/src/literate/falsesharing/falsesharing.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ which we'll sum up, and benchmark Julia's built-in, non-parallel `sum` function.
1919
using Base.Threads: nthreads
2020
using BenchmarkTools
2121

22-
data = rand(10_000_000 * nthreads());
22+
data = rand(1_000_000 * nthreads());
2323
@btime sum($data);
2424
````
2525

2626
````
27-
27.834 ms (0 allocations: 0 bytes)
27+
2.327 ms (0 allocations: 0 bytes)
2828
2929
````
3030

@@ -92,11 +92,11 @@ nthreads()
9292
````
9393

9494
````
95-
348.539 ms (221 allocations: 18.47 KiB)
95+
52.919 ms (221 allocations: 18.47 KiB)
9696
9797
````
9898

99-
A **slowdown**?! Clearly, that's the opposite of what we tried to achieve!
99+
A (huge) **slowdown**?! Clearly, that's the opposite of what we tried to achieve!
100100

101101
## The issue: False sharing
102102

@@ -135,8 +135,8 @@ function parallel_sum_tasklocal(data; nchunks = nthreads())
135135
@sync for (c, idcs) in enumerate(chunks(data; n = nchunks))
136136
@spawn begin
137137
local s = zero(eltype(data))
138-
@simd for i in idcs
139-
@inbounds s += data[i]
138+
for i in idcs
139+
s += data[i]
140140
end
141141
psums[c] = s
142142
end
@@ -149,11 +149,11 @@ end
149149
````
150150

151151
````
152-
50.021 ms (221 allocations: 18.55 KiB)
152+
1.120 ms (221 allocations: 18.55 KiB)
153153
154154
````
155155

156-
Finally, there is our expected speed up! 🎉
156+
Finally, there is a speed up! 🎉
157157

158158
Two comments are in order.
159159

@@ -179,14 +179,18 @@ end
179179
````
180180

181181
````
182-
51.305 ms (64 allocations: 5.72 KiB)
182+
893.396 μs (64 allocations: 5.72 KiB)
183183
184184
````
185185

186-
This implementation has comparable performance and, more importantly, is conceptually
186+
This implementation is conceptually
187187
clearer in that there is no explicit modification of shared state, i.e. no `pums[c] = s`,
188188
anywhere at all. We can't run into false sharing if we don't modify shared state 😉.
189189

190+
Note that since we use the built-in `sum` function, which is highly optimized, we might see
191+
better runtimes due to other effects - like SIMD and the absence of bounds checks - compared
192+
to the simple for-loop accumulation in `parallel_sum_tasklocal` above.
193+
190194
## Parallel summation with OhMyThreads
191195

192196
Finally, all of the above is abstracted away for you if you simply use [`treduce`](@ref)
@@ -200,7 +204,7 @@ using OhMyThreads: treduce
200204
````
201205

202206
````
203-
50.873 ms (68 allocations: 5.92 KiB)
207+
899.097 μs (68 allocations: 5.92 KiB)
204208
205209
````
206210

0 commit comments

Comments
 (0)