Skip to content

Commit e4f7ddf

Browse files
committed
Lab10: Added multithread intro and sum exercise.
1 parent 53cb1cb commit e4f7ddf

File tree

1 file changed

+156
-1
lines changed

1 file changed

+156
-1
lines changed

docs/src/lecture_10/lab.md

Lines changed: 156 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -247,8 +247,163 @@ df[1:50,:]
247247
```
248248

249249
## Threading
250+
The number of threads that Julia can use can be set up in an environmental variable `JULIA_NUM_THREADS` or directly on julia startup with cmd line option `-t ##` or `--threads ##`. If both are specified the latter takes precedence.
251+
```bash
252+
julia -t 8
253+
```
254+
In order to find out how many threads are currently available, there exist the `nthreads` function inside `Base.Threads` library. There is also an analog to the Distributed `myid` example, called `threadid`.
255+
```julia
256+
using Base.Threads
257+
nthreads()
258+
threadid()
259+
```
260+
As opposed to distributed/multiprocessing programming, threads have access to the whole memory of Julia's process, therefore we don't have to deal with separate environment manipulation, code loading and data transfers. However we have to be aware of the fact that memory can be modified from two different places and that there may be some performance penalties of accessing memory that is physically further from a given core (e.g. caches of different core or different NUMA[^2] nodes)
261+
262+
[^2]: NUMA - [https://en.wikipedia.org/wiki/Non-uniform\_memory\_access](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
263+
264+
!!! info "Hyper threads"
265+
In most of today's CPUs the number of threads is larger than the number of physical cores. These additional threads are usually called hyper threads[^3]. The technology relies on the fact, that for a given "instruction" there may be underutilized parts of the CPU core's machinery (such as one of many arithmetic units) and if a suitable work/instruction comes in it can be run simultaneously. In practice this means that adding more threads than physical cores may not accompanied with the expected speed up.
266+
267+
[^3]: Hyperthreading - [https://en.wikipedia.org/wiki/Hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading)
268+
269+
The easiest (not always yielding the correct result) way how to turn a code into multi threaded code is putting the `@threads` macro in front of a for loop, which instructs Julia to run the body on separate threads.
270+
```julia
271+
A = Array{Union{Int,Missing}}(missing, nthreads())
272+
for i in 1:nthreads()
273+
A[threadid()] = threadid()
274+
end
275+
A # only the first element is filled
276+
```
277+
278+
```julia
279+
@threads for i in 1:nthreads()
280+
A[threadid()] = threadid()
281+
end
282+
A # the expected results
283+
```
284+
285+
### Multithreaded sum
286+
Armed with this knowledge let's tackle the problem of a simple sum.
287+
```julia
288+
function threaded_sum_naive(A)
289+
r = zero(eltype(A))
290+
@threads for i in eachindex(A)
291+
@inbounds r += A[i]
292+
end
293+
return r
294+
end
295+
```
296+
Comparing this with the built-in sum we see not an insignificant discrepancy (one that cannot be explained by reordering of computation)
297+
```julia
298+
a = rand(10_000_000);
299+
sum(a), threaded_sum_naive(a)
300+
```
301+
Recalling what has been said above we have to be aware of the fact that the data can be accessed from multiple threads at once, which if not taken into an account means that each thread reads possibly outdated value and overwrites it with its own updated state. There are two solutions which we will tackle in the next two exercises.
302+
303+
304+
```@raw html
305+
<div class="admonition is-category-exercise">
306+
<header class="admonition-header">Exercise</header>
307+
<div class="admonition-body">
308+
```
309+
Implement `threaded_sum_atom`, which uses `Atomic` wrapper around the accumulator variable `r` in order to ensure correct locking of data access.
310+
311+
**HINTS**:
312+
- use `atomic_add!` as a replacement of `r += A[i]`
313+
- "collect" the result by dereferencing variable `r` with empty bracket operator `[]`
314+
315+
!!! info "Side note on dereferencing"
316+
In Julia we can create references to a data types, which are guarranteed to point to correct and allocated type in memory, as long as a reference exists the memory is not garbage collected. These are constructed with `Ref(x)`, `Ref(a, 7)` or `Ref{T}()` for reference to variable `x`, `7`th element of array `a` and an empty reference respectively. Dereferencing aka asking about the underlying value is done using empty bracket operator `[]`.
317+
```@repl lab10_refs
318+
x = 1 # integer
319+
rx = Ref(x) # reference to that particular integer `x`
320+
x == rx[] # dereferencing yields the same value
321+
```
322+
There also exist unsafe references/pointers `Ptr`, however we should not really come into a contact with those.
323+
324+
```@raw html
325+
</div></div>
326+
<details class = "solution-body">
327+
<summary class = "solution-header">Solution:</summary><p>
328+
```
329+
330+
```julia
331+
using BenchmarkTools
332+
a = rand(10^7);
250333

251-
### Sum with threads
334+
function threaded_sum_atom(A)
335+
r = Atomic{eltype(A)}(zero(eltype(A)))
336+
@threads for i in eachindex(A)
337+
@inbounds atomic_add!(r, A[i])
338+
end
339+
return r[]
340+
end
341+
```
342+
There is a fancier and faster way to do this by chunking the array, because this is comparable in speed to sequential code.
343+
344+
```julia
345+
function threaded_sum_fancy_atomic(A)
346+
r = Atomic{eltype(A)}(zero(eltype(A)))
347+
len, rem = divrem(length(A), nthreads())
348+
@threads for t in 1:nthreads()
349+
rₜ = zero(eltype(A))
350+
@simd for i in (1:len) .+ (t-1)*len
351+
@inbounds rₜ += A[i]
352+
end
353+
atomic_add!(r, rₜ)
354+
end
355+
# catch up any stragglers
356+
result = r[]
357+
@simd for i in length(A)-rem+1:length(A)
358+
@inbounds result += A[i]
359+
end
360+
return result
361+
end
362+
```
363+
364+
```@raw html
365+
</p></details>
366+
```
367+
368+
```@raw html
369+
<div class="admonition is-category-exercise">
370+
<header class="admonition-header">Exercise</header>
371+
<div class="admonition-body">
372+
```
373+
Implement `threaded_sum_buffer`, which uses an array of length `nthreads()` (we will call this buffer) for local aggregation of results of individual threads.
374+
375+
**HINTS**:
376+
- use `threadid()` to index the buffer array
377+
- sum the buffer array to obtain final result
378+
379+
380+
```@raw html
381+
</div></div>
382+
<details class = "solution-body">
383+
<summary class = "solution-header">Solution:</summary><p>
384+
```
385+
386+
```julia
387+
using BenchmarkTools
388+
a = rand(10^7);
389+
390+
function threaded_sum_buffer(A)
391+
R = zeros(eltype(A), nthreads())
392+
@threads for i in eachindex(A)
393+
@inbounds R[threadid()] += A[i]
394+
end
395+
r = zero(eltype(A))
396+
# sum the partial results from each thread
397+
for i in eachindex(R)
398+
@inbounds r += R[i]
399+
end
400+
return r
401+
end
402+
```
403+
404+
```@raw html
405+
</p></details>
406+
```
252407

253408
### Multithreaded file processing
254409

0 commit comments

Comments
 (0)