Skip to content

Commit 71b8ba8

Browse files
committed
commit after the lecture
1 parent d4bd895 commit 71b8ba8

File tree

3 files changed

+63
-41
lines changed

3 files changed

+63
-41
lines changed

docs/src/lecture_10/juliaset_p.jl

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
using Pkg
2+
Pkg.activate(@__DIR__)
13
using Plots
24
using BenchmarkTools
35
using Distributed
@@ -69,5 +71,5 @@ function juliaset_shared(x, y, partitions = nworkers(), n = 1000)
6971
img
7072
end
7173

72-
juliaset_shared(-0.79, 0.15)
73-
juliaset_shared(-0.79, 0.15, 16)
74+
# juliaset_shared(-0.79, 0.15)
75+
# juliaset_shared(-0.79, 0.15, 16)

docs/src/lecture_10/lecture.md

Lines changed: 59 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ Julia offers different levels of parallel programming
77

88
In this lecture, we will focus mainly on the first two, since SIMD instructions are mainly used for low-level optimization (such as writing your own very performant BLAS library), and task switching is not a true paralelism, but allows to run a different task when one task is waiting for example for IO.
99

10-
**The most important lesson is that before you jump into the parallelism, be certain you have made your squential code as fast as possible.**
10+
**The most important lesson is that before you jump into the parallelism, be certain you have made your sequential code as fast as possible.**
1111

1212
## Process-level paralelism
13-
Process-level paralelism means we run several instances of Julia (in different processes) and they communicate between each other using inter-process communication (IPC). The implementation of IPC differs if parallel julia instances share the same machine, or they are located spread over the network. By default, different processes *do not share any libraries or any variables*. The are loaded as clean and it is up to the user to set-up all needed code and data.
13+
Process-level paralelism means we run several instances of Julia (in different processes) and they communicate between each other using inter-process communication (IPC). The implementation of IPC differs if parallel julia instances share the same machine, or they are on different machines spread over the network. By default, different processes *do not share any libraries or any variables*. They are loaded clean and it is up to the user to set-up all needed code and data.
1414

1515
Julia's default modus operandi is a single *main* instance controlling several workers. This main instance has `myid() == 1`, worker processes receive higher numbers. Julia can be started with multiple workers from the very beggining, using `-p` switch as
1616
```julia
@@ -31,14 +31,16 @@ As we have mentioned, workers are loaded without libraries. We can see that by r
3131
```julia
3232
@everywhere InteractiveUtils.varinfo()
3333
```
34-
which fails, but
34+
which fails, but after loading `InteractiveUtils` everywhere
3535
```julia
36+
using Statistics
3637
@everywhere begin
3738
using InteractiveUtils
3839
println(InteractiveUtils.varinfo(;imported = true))
3940
end
4041
```
41-
`@everywhere` macro allows us to define function and variables, and import libraries on workers as
42+
we see that `Statistics` was loaded only on the main process. Thus, there is not magical sharing of data and code.
43+
With `@everywhere` macro we can define function and variables, and import libraries on workers as
4244
```julia
4345
@everywhere begin
4446
foo(x, y) = x * y + sin(y)
@@ -79,7 +81,7 @@ An interesting feature of `fetch` is that it re-throw an exception raised on a d
7981
end
8082
r = @spawnat 2 exfoo()
8183
```
82-
where `@spawnat` is a an alternative to `remotecall`, which executes a closure around expression (in this case `exfoo()`) on a specified worker (in this case 2). Fetching the result `r` throws an exception on the main process.
84+
where we have used `@spawnat` instead of `remote_call`. It is higher level alternative executing a closure around the expression (in this case `exfoo()`) on a specified worker, in this case 2. Coming back to the example, when we fetch the result `r`, the exception is throwed on the main process, not on the worker
8385
```julia
8486
fetch(r)
8587
```
@@ -133,14 +135,14 @@ using Plots
133135
frac = juliaset(-0.79, 0.15)
134136
plot(heatmap(1:size(frac,1),1:size(frac,2), frac, color=:Spectral))
135137
```
136-
To observe the execution length, we will use `BenchmarkTools.jl` again
138+
To observe the execution length, we will use `BenchmarkTools.jl`
137139
```
138140
using BenchmarkTools
139141
julia> @btime juliaset(-0.79, 0.15);
140142
39.822 ms (2 allocations: 976.70 KiB)
141143
```
142144

143-
Let's now try to speed-up the computation using more processes.
145+
Let's now try to speed-up the computation using more processes. We first make functions available to workers
144146
```julia
145147
using Plots
146148
@everywhere begin
@@ -162,10 +164,8 @@ using Plots
162164
nothing
163165
end
164166
end
165-
frac = juliaset(-0.79, 0.15)
166-
plot(heatmap(1:size(frac,1),1:size(frac,2), frac, color=:Spectral))
167167
```
168-
We can split the computation of the whole image into bands, such that each worker computes a smaller portion.
168+
For the actual parallelisation, we split the computation of the whole image into bands, such that each worker computes a smaller portion.
169169
```julia
170170
@everywhere begin
171171
function juliaset_columns(c, n, columns)
@@ -208,17 +208,19 @@ julia> @btime juliaset_pmap(-0.79, 0.15);
208208
which has slightly better timing then the version based on `@spawnat` and `fetch` (as explained below in section about `Threads`, the parallel computation of Julia set suffers from each pixel taking different time to compute, which can be relieved by dividing the work into more parts --- `@btime juliaset_pmap(-0.79, 0.15, 1000, 16);`).
209209

210210
## Shared memory
211-
When all workers and master are located on the same process, and the OS supports sharing memory between processes (by sharing memory pages), we can use `SharedArrays` to avoid sending the matrix with results.
212-
```julia
213-
@everywhere using SharedArrays
214-
function juliaset_shared(x, y, n=1000)
215-
c = x + y*im
216-
img = SharedArray(Array{UInt8,2}(undef,n,n))
217-
@sync @distributed for j in 1:n
218-
juliaset_column!(img, c, n, j, j)
219-
end
220-
return img
221-
end
211+
When main and all workers are located on the same process, and the OS supports sharing memory between processes (by sharing memory pages), we can use `SharedArrays` to avoid sending the matrix with results.
212+
```julia
213+
@everywhere begin
214+
using SharedArrays
215+
function juliaset_shared(x, y, n=1000)
216+
c = x + y*im
217+
img = SharedArray(Array{UInt8,2}(undef,n,n))
218+
@sync @distributed for j in 1:n
219+
juliaset_column!(img, c, n, j, j)
220+
end
221+
return img
222+
end
223+
end
222224
julia> @elapsed juliaset_shared(-0.79, 0.15);
223225
0.021699503
224226
```
@@ -254,21 +256,20 @@ The code for the main will look like
254256
function juliaset_channels(x, y, n = 1000, np = nworkers())
255257
c = x + y*im
256258
columns = Iterators.partition(1:n, div(n, np))
257-
instructions = RemoteChannel(() -> Channel{Tuple}(np))
259+
instructions = RemoteChannel(() -> Channel(np))
258260
foreach(cols -> put!(instructions, (c, n, cols)), columns)
259-
results = RemoteChannel(()->Channel{Tuple}(np))
261+
results = RemoteChannel(()->Channel(np))
260262
rfuns = [@spawnat i juliaset_channel_worker(instructions, results) for i in workers()]
261263

262264
img = Array{UInt8,2}(undef, n, n)
263-
while isready(results)
265+
for i in 1:np
264266
cols, impart = take!(results)
265267
img[:,cols] .= impart;
266268
end
267269
img
268270
end
269271

270272
julia> @btime juliaset_channels(-0.79, 0.15);
271-
254.151 μs (254 allocations: 987.09 KiB)
272273
```
273274
The execution timw is much higher then what we have observed in the previous cases and changing the number of workers does not help much. What went wrong? The reason is that setting up the infrastructure around remote channels is a costly process. Consider the following alternative, where (i) we let workers to run endlessly and (ii) the channel infrastructure is set-up once and wrapped into an anonymous function
274275
```julia
@@ -367,6 +368,17 @@ julia> @btime t()
367368
17.551 ms (774 allocations: 1.94 MiB)
368369
foreach(i -> put!(t.instructions, :stop), workers())
369370
```
371+
In some use-cases, the alternative can be to put all jobs to the `RemoteChannel` before workers are started, and then stop the workers when the remote channel is empty as
372+
```julia
373+
@everywhere begin
374+
function juliaset_channel_worker(instructions, results)
375+
while !isready(instructions)
376+
c, n, cols = take!(instructions)
377+
put!(results, (cols, juliaset_columns(c, n, cols)))
378+
end
379+
end
380+
end
381+
```
370382

371383
## Sending data
372384
Sending parameters of functions and receiving results from a remotely called functions migh incur a significant cost.
@@ -379,7 +391,7 @@ and
379391
```julia
380392
Bref = @spawnat :any rand(1000,1000)^2;
381393
```
382-
2. It is not only volume of data (in terms of the number of bytes), but also a complexity of objects that are being sent. Serialization can be very time consuming, an efficient converstion to something simple might be wort
394+
2. It is not only volume of data (in terms of the number of bytes), but also a complexity of objects that are being sent. Serialization can be very time consuming, an efficient converstion to something simple might be worth
383395
```julia
384396
using BenchmarkTools
385397
@everywhere begin
@@ -422,7 +434,14 @@ remotecall_fetch(g -> eval(:(g = $(g))), 2, g)
422434
which is implemented in the `ParallelDataTransfer.jl` with other variants, but in general, this construct should be avoided.
423435

424436
## Practical advices
425-
Recall that (i) workers are started as clean processes and (ii) they might not share the same environment with the main process. The latter is due to the possibility of remote machines to have a different directory structure. Our best practices are:
437+
Recall that (i) workers are started as clean processes and (ii) they might not share the same environment with the main process. The latter is due to the possibility of remote machines to have a different directory structure.
438+
```julia
439+
@everywhere begin
440+
using Pkg
441+
println(Pkg.project().path)
442+
end
443+
```
444+
Our advices earned by practice are:
426445
- to have shared directory (shared home) with code and to share the location of packages
427446
- to place all code for workers to one file, let's call it `worker.jl` (author of this includes the code for master as well).
428447
- put to the beggining of `worker.jl` code activating specified environment as
@@ -550,7 +569,7 @@ end
550569
julia> @btime juliaset_forkjoin(-0.79, 0.15);
551570
10.326 ms (142 allocations: 986.83 KiB)
552571
```
553-
Due to task switching overhead, increasing the granularity might not pay off.
572+
Unfortunatelly, the `LoggingProfiler` does not handle task migration at the moment, which means that we cannot visualize the results. Due to task switching overhead, increasing the granularity might not pay off.
554573
```julia
555574
4 tasks: 16.262 ms (21 allocations: 978.05 KiB)
556575
8 tasks: 10.660 ms (45 allocations: 979.80 KiB)
@@ -640,9 +659,8 @@ files = filter(isfile, readdir("/Users/tomas.pevny/Downloads/", join = true))
640659
is much better.
641660

642661

643-
## Multi-Threadding
644-
- Locks / lock-free multi-threadding
645-
662+
## Locks / lock-free multi-threadding
663+
Avoid locks.
646664

647665
## Take away message
648666
When deciding, what kind of paralelism to employ, consider following
@@ -652,11 +670,13 @@ When deciding, what kind of paralelism to employ, consider following
652670
- `Transducers` thrives for (almost) the same code to support thread- and process-based paralelism.
653671

654672
### Materials
655-
- http://cecileane.github.io/computingtools/pages/notes1209.html
656-
- https://lucris.lub.lu.se/ws/portalfiles/portal/61129522/julia_parallel.pdf
657-
- http://igoro.com/archive/gallery-of-processor-cache-effects/
658-
- https://www.csd.uwo.ca/~mmorenom/cs2101a_moreno/Parallel_computing_with_Julia.pdf
659-
- Threads: https://juliahighperformance.com/code/Chapter09.html
660-
- Processes: https://juliahighperformance.com/code/Chapter10.html
661-
- Alan Adelman uses FLoops in https://www.youtube.com/watch?v=dczkYlOM2sg
662-
- Examples: ?Heat equation? from https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial#Examples
673+
- [http://cecileane.github.io/computingtools/pages/notes1209.html](http://cecileane.github.io/computingtools/pages/notes1209.html)
674+
- [https://lucris.lub.lu.se/ws/portalfiles/portal/61129522/julia_parallel.pdf](https://lucris.lub.lu.se/ws/portalfiles/portal/61129522/julia_parallel.pdf)
675+
- [http://igoro.com/archive/gallery-of-processor-cache-effects/](http://igoro.com/archive/gallery-of-processor-cache-effects/)
676+
- [https://www.csd.uwo.ca/~mmorenom/cs2101a_moreno/Parallel_computing_with_Julia.pdf](https://www.csd.uwo.ca/~mmorenom/cs2101a_moreno/Parallel_computing_with_Julia.pdf)
677+
- Complexity of thread schedulling [https://www.youtube.com/watch?v=YdiZa0Y3F3c](https://www.youtube.com/watch?v=YdiZa0Y3F3c)
678+
- TapIR --- Teaching paralelism to Julia compiler [https://www.youtube.com/watch?v=-JyK5Xpk7jE](https://www.youtube.com/watch?v=-JyK5Xpk7jE)
679+
- Threads: [https://juliahighperformance.com/code/Chapter09.html](https://juliahighperformance.com/code/Chapter09.html)
680+
- Processes: [https://juliahighperformance.com/code/Chapter10.html](https://juliahighperformance.com/code/Chapter10.html)
681+
- Alan Adelman uses FLoops in [https://www.youtube.com/watch?v=dczkYlOM2sg](https://www.youtube.com/watch?v=dczkYlOM2sg)
682+
- Examples: ?Heat equation? from [https://hpc.llnl.gov/training/tutorials/](introduction-parallel-computing-tutorial#Examples(https://hpc.llnl.gov/training/tutorials/)

docs/src/lecture_10/profile.png

-1.92 KB
Loading

0 commit comments

Comments
 (0)