Skip to content

Commit b107907

Browse files
authored
Revamp section describing how to fix 1:nthreads() pattern (#137)
* remove section describing the partial fix to 1:nthreads() pattern * add back in the non-recommended fix, but make it use `maxthreadid`. * run TLS doc generator
1 parent 1f27155 commit b107907

File tree

3 files changed

+93
-62
lines changed

3 files changed

+93
-62
lines changed

docs/src/literate/tls/Project.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,4 @@
22
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
33
Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
44
OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
5-
StrideArrays = "d1fa6d79-ef01-42a6-86c9-f7c551f8593b"
65
ThreadPinning = "811555cd-349b-4f26-b7bc-1f208b848042"

docs/src/literate/tls/tls.jl

Lines changed: 37 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -229,30 +229,24 @@ sleep(2) #hide
229229
#
230230
using Base.Threads: threadid
231231

232-
function matmulsums_perthread_naive(As, Bs)
232+
function matmulsums_perthread_incorrect(As, Bs)
233233
N = size(first(As), 1)
234234
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]
235235
tmap(As, Bs) do A, B
236236
C = Cs[threadid()]
237237
mul!(C, A, B)
238238
sum(C)
239239
end
240-
end
241-
242-
## non uniform workload
243-
As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
244-
Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
245-
res_nu = matmulsums(As_nu, Bs_nu);
246-
247-
res_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)
248-
res_nu res_pt_naive
240+
end;
249241

250-
# Unfortunately, this approach is [**generally wrong**](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/). The first issue is that `threadid()`
242+
# This approach is [**wrong**](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/). The first issue is that `threadid()`
251243
# doesn't necessarily start at 1 (and thus might return a value `> nthreads()`), in which
252244
# case `Cs[threadid()]` would be an out-of-bounds access attempt. This might be surprising
253245
# but is a simple consequence of the ordering of different kinds of Julia threads: If Julia
254246
# is started with a non-zero number of interactive threads, e.g. `--threads 5,2`, the
255247
# interactive threads come first (look at `Threads.threadpool.(1:Threads.maxthreadid())`).
248+
# [Starting in julia v1.12, julia will launch with at one interactive thread](https://github.com/JuliaLang/julia/pull/57087),
249+
# and so the above code will error by default.
256250
#
257251
# But even if we account for this offset there is another, more fundamental problem, namely
258252
# **task-migration**. By default, all spawned parallel tasks are "non-sticky" and can
@@ -266,31 +260,55 @@ res_nu ≈ res_pt_naive
266260
# (Note that, in practice, this - most likely 😉 - doesn't happen for the very simple example
267261
# above, but you can't rely on it!)
268262
#
269-
# ### The quick fix (with caveats)
263+
# ### The quick (and non-recommended) fix
270264
#
271265
# A simple solution for the task-migration issue is to opt-out of dynamic scheduling with
272266
# `scheduler=:static` (or `scheduler=StaticScheduler()`). This scheduler statically
273267
# assigns tasks to threads upfront without any dynamic rescheduling
274268
# (the tasks are sticky and won't migrate).
275269
#
270+
# We'll also need to switch from `nthreads` to `maxthreadid`, since that can be greater than
271+
# `nthreads`, as described above.
272+
#
273+
num_to_store() = isdefined(Threads, :maxthreadid) ? Threads.maxthreadid() : Threads.nthreads()
274+
276275
function matmulsums_perthread_static(As, Bs)
277276
N = size(first(As), 1)
278-
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]
277+
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:num_to_store()]
278+
## Note!!!
279+
## This code is *incorrect* if used with a non-static scheduler. this
280+
## isn't just true in OhMyThreads but also applies to `Threads.@threads`
281+
## You *must* use `Threads.@threads :static` or `scheduler = :static` to
282+
## avoid race-conditions caused by task migration.
279283
tmap(As, Bs; scheduler = :static) do A, B
280284
C = Cs[threadid()]
281285
mul!(C, A, B)
282286
sum(C)
283287
end
284288
end
285289

290+
## non uniform workload
291+
As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
292+
Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
293+
res_nu = matmulsums(As_nu, Bs_nu);
294+
286295
res_pt_static = matmulsums_perthread_static(As_nu, Bs_nu)
287296
res_nu res_pt_static
288297

289-
# However, this approach doesn't solve the offset issue and, even worse, makes the parallel code
290-
# non-composable: If we call other multithreaded functions within the `tmap` or if
291-
# our parallel `matmulsums_perthread_static` itself gets called from another parallel region
292-
# we will likely oversubscribe the Julia threads and get subpar performance. Given these
293-
# caveats, we should therefore generally take a different approach.
298+
# However, this approach has serious shortcomings.
299+
#
300+
# 1. It can easily be broken if someone doesn't know that the `scheduler = :static`
301+
# option is required for correctness, and removes it in a refactor.
302+
# 2. It makes the parallel code non-composable: If we call other multithreaded functions
303+
# within the `tmap` or if our parallel `matmulsums_perthread_static` itself gets called
304+
# from another parallel region we will likely oversubscribe the Julia threads and get subpar
305+
# performance.
306+
# 3. It can waste memory by creating too many temporary storage slots since `maxthreadid()`
307+
# can give an over-estimate of the number of slots needed for the computation.
308+
#
309+
# While the above pattern might be the easiest to migrate to from the incorrect pattern,
310+
# we do not recommend it. We instead urge you to use task-local-storages, or the `Channel`
311+
# based techniques described below:
294312
#
295313
# ### The safe way: `Channel`
296314
#
@@ -389,7 +407,7 @@ sort(res_nu) ≈ sort(res_channel_flipped)
389407
# we could replace `Channel() do .. end` with
390408
# `OhMyThreads.ChannelLike(1:length(As))`.
391409

392-
# ## Bumper.jl (only for the brave)
410+
# ### Bumper.jl (only for the brave)
393411
#
394412
# If you are bold and want to cut down temporary allocations even more you can
395413
# give [Bumper.jl](https://github.com/MasonProtter/Bumper.jl) a try. Essentially, it

docs/src/literate/tls/tls.md

Lines changed: 56 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -262,12 +262,12 @@ using BenchmarkTools
262262
````
263263

264264
````
265-
nthreads() = 10
266-
61.314 ms (3 allocations: 518.17 KiB)
267-
22.122 ms (1621 allocations: 384.06 MiB)
268-
7.620 ms (204 allocations: 10.08 MiB)
269-
7.702 ms (126 allocations: 5.03 MiB)
270-
7.600 ms (127 allocations: 5.03 MiB)
265+
nthreads() = 6
266+
50.439 ms (6 allocations: 518.14 KiB)
267+
39.387 ms (2467 allocations: 384.09 MiB)
268+
9.743 ms (165 allocations: 6.05 MiB)
269+
9.749 ms (962 allocations: 3.05 MiB)
270+
9.859 ms (199 allocations: 3.04 MiB)
271271
272272
````
273273

@@ -293,35 +293,25 @@ and then to use the `threadid()` to select a buffer for a running task.
293293
````julia
294294
using Base.Threads: threadid
295295

296-
function matmulsums_perthread_naive(As, Bs)
296+
function matmulsums_perthread_incorrect(As, Bs)
297297
N = size(first(As), 1)
298298
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]
299299
tmap(As, Bs) do A, B
300300
C = Cs[threadid()]
301301
mul!(C, A, B)
302302
sum(C)
303303
end
304-
end
305-
306-
# non uniform workload
307-
As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
308-
Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
309-
res_nu = matmulsums(As_nu, Bs_nu);
310-
311-
res_pt_naive = matmulsums_perthread_naive(As_nu, Bs_nu)
312-
res_nu res_pt_naive
313-
````
314-
315-
````
316-
true
304+
end;
317305
````
318306

319-
Unfortunately, this approach is [**generally wrong**](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/). The first issue is that `threadid()`
307+
This approach is [**wrong**](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/). The first issue is that `threadid()`
320308
doesn't necessarily start at 1 (and thus might return a value `> nthreads()`), in which
321309
case `Cs[threadid()]` would be an out-of-bounds access attempt. This might be surprising
322310
but is a simple consequence of the ordering of different kinds of Julia threads: If Julia
323311
is started with a non-zero number of interactive threads, e.g. `--threads 5,2`, the
324312
interactive threads come first (look at `Threads.threadpool.(1:Threads.maxthreadid())`).
313+
[Starting in julia v1.12, julia will launch with at one interactive thread](https://github.com/JuliaLang/julia/pull/57087),
314+
and so the above code will error by default.
325315

326316
But even if we account for this offset there is another, more fundamental problem, namely
327317
**task-migration**. By default, all spawned parallel tasks are "non-sticky" and can
@@ -335,24 +325,39 @@ condition because both tasks are mutating the same buffer.
335325
(Note that, in practice, this - most likely 😉 - doesn't happen for the very simple example
336326
above, but you can't rely on it!)
337327

338-
### The quick fix (with caveats)
328+
### The quick (and non-recommended) fix
339329

340330
A simple solution for the task-migration issue is to opt-out of dynamic scheduling with
341331
`scheduler=:static` (or `scheduler=StaticScheduler()`). This scheduler statically
342332
assigns tasks to threads upfront without any dynamic rescheduling
343333
(the tasks are sticky and won't migrate).
344334

335+
We'll also need to switch from `nthreads` to `maxthreadid`, since that can be greater than
336+
`nthreads`, as described above.
337+
345338
````julia
339+
num_to_store() = isdefined(Threads, :maxthreadid) ? Threads.maxthreadid() : Threads.nthreads()
340+
346341
function matmulsums_perthread_static(As, Bs)
347342
N = size(first(As), 1)
348-
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:nthreads()]
343+
Cs = [Matrix{Float64}(undef, N, N) for _ in 1:num_to_store()]
344+
# Note!!!
345+
# This code is *incorrect* if used with a non-static scheduler. this
346+
# isn't just true in OhMyThreads but also applies to `Threads.@threads`
347+
# You *must* use `Threads.@threads :static` or `scheduler = :static` to
348+
# avoid race-conditions caused by task migration.
349349
tmap(As, Bs; scheduler = :static) do A, B
350350
C = Cs[threadid()]
351351
mul!(C, A, B)
352352
sum(C)
353353
end
354354
end
355355

356+
# non uniform workload
357+
As_nu = [rand(256, isqrt(i)^2) for i in 1:768];
358+
Bs_nu = [rand(isqrt(i)^2, 256) for i in 1:768];
359+
res_nu = matmulsums(As_nu, Bs_nu);
360+
356361
res_pt_static = matmulsums_perthread_static(As_nu, Bs_nu)
357362
res_nu res_pt_static
358363
````
@@ -361,11 +366,20 @@ res_nu ≈ res_pt_static
361366
true
362367
````
363368

364-
However, this approach doesn't solve the offset issue and, even worse, makes the parallel code
365-
non-composable: If we call other multithreaded functions within the `tmap` or if
366-
our parallel `matmulsums_perthread_static` itself gets called from another parallel region
367-
we will likely oversubscribe the Julia threads and get subpar performance. Given these
368-
caveats, we should therefore generally take a different approach.
369+
However, this approach has serious shortcomings.
370+
371+
1. It can easily be broken if someone doesn't know that the `scheduler = :static`
372+
option is required for correctness, and removes it in a refactor.
373+
2. It makes the parallel code non-composable: If we call other multithreaded functions
374+
within the `tmap` or if our parallel `matmulsums_perthread_static` itself gets called
375+
from another parallel region we will likely oversubscribe the Julia threads and get subpar
376+
performance.
377+
3. It can waste memory by creating too many temporary storage slots since `maxthreadid()`
378+
can give an over-estimate of the number of slots needed for the computation.
379+
380+
While the above pattern might be the easiest to migrate to from the incorrect pattern,
381+
we do not recommend it. We instead urge you to use task-local-storages, or the `Channel`
382+
based techniques described below:
369383

370384
### The safe way: `Channel`
371385

@@ -419,13 +433,13 @@ of which gives us dynamic load balancing.
419433
````
420434

421435
````
422-
170.563 ms (126 allocations: 5.03 MiB)
423-
165.647 ms (108 allocations: 5.02 MiB)
424-
172.216 ms (114 allocations: 5.02 MiB)
425-
108.662 ms (237 allocations: 10.05 MiB)
426-
114.673 ms (185 allocations: 5.04 MiB)
427-
97.933 ms (1118 allocations: 50.13 MiB)
428-
96.868 ms (746 allocations: 5.10 MiB)
436+
212.200 ms (962 allocations: 3.05 MiB)
437+
212.014 ms (191 allocations: 4.04 MiB)
438+
211.336 ms (190 allocations: 3.04 MiB)
439+
168.835 ms (1136 allocations: 6.05 MiB)
440+
169.097 ms (334 allocations: 3.04 MiB)
441+
130.469 ms (2530 allocations: 30.17 MiB)
442+
131.037 ms (1487 allocations: 3.14 MiB)
429443
430444
````
431445

@@ -484,9 +498,9 @@ Quick benchmark:
484498
````
485499

486500
````
487-
94.389 ms (170 allocations: 5.07 MiB)
488-
94.580 ms (271 allocations: 10.10 MiB)
489-
94.768 ms (1071 allocations: 50.41 MiB)
501+
137.431 ms (133 allocations: 3.04 MiB)
502+
126.854 ms (211 allocations: 6.06 MiB)
503+
127.647 ms (836 allocations: 30.29 MiB)
490504
491505
````
492506

@@ -497,7 +511,7 @@ need to copy the elements into the `Channel`. Concretely, in the example above,
497511
we could replace `Channel() do .. end` with
498512
`OhMyThreads.ChannelLike(1:length(As))`.
499513

500-
## Bumper.jl (only for the brave)
514+
### Bumper.jl (only for the brave)
501515

502516
If you are bold and want to cut down temporary allocations even more you can
503517
give [Bumper.jl](https://github.com/MasonProtter/Bumper.jl) a try. Essentially, it
@@ -506,8 +520,8 @@ dynamically allocate memory to, and reset them at the end of a code block, just
506520
Julia's stack.
507521
Be warned though that Bumper.jl is (1) a rather young package with (likely) some bugs
508522
and (2) can easily lead to segfaults when used incorrectly. If you can live with the
509-
risk, Bumper.jl is especially useful for cases where the size of the preallocated matrix
510-
isn't known ahead of time, and even more useful if we want to do many intermediate
523+
risk, Bumper.jl is especially useful for causes we don't know ahead of time how large
524+
a matrix to pre-allocate, and even more useful if we want to do many intermediate
511525
allocations on the task, not just one. For our example, this isn't the case but let's
512526
nonetheless how one would use Bumper.jl here.
513527

@@ -532,7 +546,7 @@ sort(res) ≈ sort(res_bumper)
532546
````
533547

534548
````
535-
7.814 ms (134 allocations: 27.92 KiB)
549+
9.439 ms (198 allocations: 39.25 KiB)
536550
537551
````
538552

0 commit comments

Comments
 (0)