You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lecture_10/lab.md
+156-1Lines changed: 156 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -247,8 +247,163 @@ df[1:50,:]
247
247
```
248
248
249
249
## Threading
250
+
The number of threads that Julia can use can be set up in an environmental variable `JULIA_NUM_THREADS` or directly on julia startup with cmd line option `-t ##` or `--threads ##`. If both are specified the latter takes precedence.
251
+
```bash
252
+
julia -t 8
253
+
```
254
+
In order to find out how many threads are currently available, there exist the `nthreads` function inside `Base.Threads` library. There is also an analog to the Distributed `myid` example, called `threadid`.
255
+
```julia
256
+
using Base.Threads
257
+
nthreads()
258
+
threadid()
259
+
```
260
+
As opposed to distributed/multiprocessing programming, threads have access to the whole memory of Julia's process, therefore we don't have to deal with separate environment manipulation, code loading and data transfers. However we have to be aware of the fact that memory can be modified from two different places and that there may be some performance penalties of accessing memory that is physically further from a given core (e.g. caches of different core or different NUMA[^2] nodes)
261
+
262
+
[^2]: NUMA - [https://en.wikipedia.org/wiki/Non-uniform\_memory\_access](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
263
+
264
+
!!! info "Hyper threads"
265
+
In most of today's CPUs the number of threads is larger than the number of physical cores. These additional threads are usually called hyper threads[^3]. The technology relies on the fact, that for a given "instruction" there may be underutilized parts of the CPU core's machinery (such as one of many arithmetic units) and if a suitable work/instruction comes in it can be run simultaneously. In practice this means that adding more threads than physical cores may not accompanied with the expected speed up.
The easiest (not always yielding the correct result) way how to turn a code into multi threaded code is putting the `@threads` macro in front of a for loop, which instructs Julia to run the body on separate threads.
270
+
```julia
271
+
A =Array{Union{Int,Missing}}(missing, nthreads())
272
+
for i in1:nthreads()
273
+
A[threadid()] =threadid()
274
+
end
275
+
A # only the first element is filled
276
+
```
277
+
278
+
```julia
279
+
@threadsfor i in1:nthreads()
280
+
A[threadid()] =threadid()
281
+
end
282
+
A # the expected results
283
+
```
284
+
285
+
### Multithreaded sum
286
+
Armed with this knowledge let's tackle the problem of a simple sum.
287
+
```julia
288
+
functionthreaded_sum_naive(A)
289
+
r =zero(eltype(A))
290
+
@threadsfor i ineachindex(A)
291
+
@inbounds r += A[i]
292
+
end
293
+
return r
294
+
end
295
+
```
296
+
Comparing this with the built-in sum we see not an insignificant discrepancy (one that cannot be explained by reordering of computation)
297
+
```julia
298
+
a =rand(10_000_000);
299
+
sum(a), threaded_sum_naive(a)
300
+
```
301
+
Recalling what has been said above we have to be aware of the fact that the data can be accessed from multiple threads at once, which if not taken into an account means that each thread reads possibly outdated value and overwrites it with its own updated state. There are two solutions which we will tackle in the next two exercises.
Implement `threaded_sum_atom`, which uses `Atomic` wrapper around the accumulator variable `r` in order to ensure correct locking of data access.
310
+
311
+
**HINTS**:
312
+
- use `atomic_add!` as a replacement of `r += A[i]`
313
+
- "collect" the result by dereferencing variable `r` with empty bracket operator `[]`
314
+
315
+
!!! info "Side note on dereferencing"
316
+
In Julia we can create references to a data types, which are guarranteed to point to correct and allocated type in memory, as long as a reference exists the memory is not garbage collected. These are constructed with `Ref(x)`, `Ref(a, 7)` or `Ref{T}()` for reference to variable `x`, `7`th element of array `a` and an empty reference respectively. Dereferencing aka asking about the underlying value is done using empty bracket operator `[]`.
317
+
```@repl lab10_refs
318
+
x = 1 # integer
319
+
rx = Ref(x) # reference to that particular integer `x`
320
+
x == rx[] # dereferencing yields the same value
321
+
```
322
+
There also exist unsafe references/pointers `Ptr`, however we should not really come into a contact with those.
323
+
324
+
```@raw html
325
+
</div></div>
326
+
<details class = "solution-body">
327
+
<summary class = "solution-header">Solution:</summary><p>
328
+
```
329
+
330
+
```julia
331
+
using BenchmarkTools
332
+
a =rand(10^7);
250
333
251
-
### Sum with threads
334
+
functionthreaded_sum_atom(A)
335
+
r =Atomic{eltype(A)}(zero(eltype(A)))
336
+
@threadsfor i ineachindex(A)
337
+
@inboundsatomic_add!(r, A[i])
338
+
end
339
+
return r[]
340
+
end
341
+
```
342
+
There is a fancier and faster way to do this by chunking the array, because this is comparable in speed to sequential code.
Implement `threaded_sum_buffer`, which uses an array of length `nthreads()` (we will call this buffer) for local aggregation of results of individual threads.
374
+
375
+
**HINTS**:
376
+
- use `threadid()` to index the buffer array
377
+
- sum the buffer array to obtain final result
378
+
379
+
380
+
```@raw html
381
+
</div></div>
382
+
<details class = "solution-body">
383
+
<summary class = "solution-header">Solution:</summary><p>
0 commit comments