Skip to content

Commit 8c541f4

Browse files
authored
A few improvements (#114)
* add julia for hpc course * mention array slices and column major order * improve section about at-threads * drop comment about at-spawn * drop course (separate PR) * mention ThreadPinning.jl * mention a few popular external profilers
1 parent ff73c90 commit 8c541f4

File tree

1 file changed

+13
-7
lines changed

1 file changed

+13
-7
lines changed

optimizing/index.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,10 @@ No matter which tool you use, if your code is too fast to collect samples, you m
146146
Inspecting the call graph can help identify which types are responsible for the allocations.
147147
}
148148

149+
### External profilers
150+
151+
Apart from the built-in `Profile` standard library, there are a few external profilers that you can use including [Intel VTune](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html) (in combination with [IntelITT.jl](https://github.com/JuliaPerf/IntelITT.jl)), [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) (in combination with [NVTX.jl](https://github.com/JuliaGPU/NVTX.jl)), and [Tracy](https://docs.julialang.org/en/v1/devdocs/external_profilers/#Tracy-Profiler).
152+
149153
## Type stability
150154

151155
\tldr{Use JET.jl to automatically detect type instabilities in your code, and `@code_warntype` or Cthulhu.jl to do so manually. DispatchDoctor.jl can help prevent them altogether.}
@@ -220,7 +224,7 @@ A more direct approach is to error whenever a type instability occurs: the macro
220224

221225
After ensuring type stability, one should try to reduce the number of heap allocations a program makes.
222226
Again, the Julia manual has a series of tricks related to [arrays and allocations](https://docs.julialang.org/en/v1.12-dev/manual/performance-tips/#Memory-management-and-arrays) which you should take a look at.
223-
In particular, try to modify existing arrays instead of allocating new objects.
227+
In particular, try to modify existing arrays instead of allocating new objects (caution with array slices) and try to access arrays in the right order (column major order).
224228

225229
And again, you can also choose to error whenever an allocation occurs, with the help of [AllocCheck.jl](https://github.com/JuliaLang/AllocCheck.jl).
226230
By annotating a function with `@check_allocs`, if the function is run and the compiler detects that it might allocate, it will throw an error.
@@ -315,8 +319,8 @@ The README of StaticCompiler.jl contains a more [detailed guide](https://github.
315319

316320
\tldr{Use `Threads` or OhMyThreads.jl on a single machine, `Distributed` or MPI.jl on a computing cluster. GPU-compatible code is easy to write and run.}
317321

318-
Code can be made to run faster through parallel execution with [multithreading](https://docs.julialang.org/en/v1/manual/multi-threading/) or [multiprocessing / distributed computing](https://docs.julialang.org/en/v1/manual/distributed-computing/).
319-
Many common operations such as maps and reductions can be trivially parallelised through either method by using their respective Julia packages.
322+
Code can be made to run faster through parallel execution with [multithreading](https://docs.julialang.org/en/v1/manual/multi-threading/) (shared-memory parallelism) or [multiprocessing / distributed computing](https://docs.julialang.org/en/v1/manual/distributed-computing/).
323+
Many common operations such as maps and reductions can be trivially parallelised through either method by using their respective Julia packages (e.g `pmap` from Distributed.jl and `tmap` from OhMyThreads.jl).
320324
Multithreading is available on almost all modern hardware, whereas distributed computing is most useful to users of high-performance computing clusters.
321325

322326
### Multithreading
@@ -342,10 +346,9 @@ Once Julia is running, you can check if this was successful by calling `Threads.
342346
}
343347

344348
Regardless of the number of threads, you can parallelise a for loop with the macro `Threads.@threads`.
345-
The macros `@spawn` and `@async` function similarly, but require more manual management of the results, which can result in bugs and performance footguns.
346-
For this reason `@threads` is recommended for those who do not wish to use third-party packages.
349+
The macros `@spawn` and `@async` function similarly, but require more manual management of tasks and their results. For this reason `@threads` is recommended for those who do not wish to use third-party packages.
347350

348-
When you design multithreaded code, you need to be careful to avoid "race conditions", i.e. situations when competing threads try to write different things to the same memory location.
351+
When designing multithreaded code, you should generally try to write to shared memory as rarely as possible. Where it cannot be avoided, you need to be careful to avoid "race conditions", i.e. situations when competing threads try to write different things to the same memory location.
349352
It is usually a good idea to separate memory accesses with loop indices, as in the example below:
350353

351354
```julia @threads-forloop
@@ -354,12 +357,15 @@ Threads.@threads for i in 1:4
354357
results[i] = i^2
355358
end
356359
```
360+
Almost always, it is [**not** a good idea to use `threadid()`](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/).
357361

358-
Managing threads and their memory use is made much easier by [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), which provides a user-friendly alternative to `Threads`.
362+
Even if you manage to avoid any race conditions in your multithreaded code, it is very easy to run into subtle performance issues (like [false sharing](https://en.wikipedia.org/wiki/False_sharing)). For these reasons, you might want to consider using a high-level package like [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), which provides a user-friendly alternative to `Threads` and makes managing threads and their memory use much easier.
359363
The helpful [translation guide](https://juliafolds2.github.io/OhMyThreads.jl/stable/translation/) will get you started in a jiffy.
360364

361365
If the latency of spinning up new threads becomes a bottleneck, check out [Polyester.jl](https://github.com/JuliaSIMD/Polyester.jl) for very lightweight threads that are quicker to start.
362366

367+
If you're on Linux, you should consider using [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl) to pin your Julia threads to CPU cores to obtain stable and optimal performance. The package can also be used to visualize where the Julia threads are running on your system (see `threadinfo()`).
368+
363369
\advanced{
364370
Some widely used parallel programming packages like [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) (which also powers [Octavian.jl](https://github.com/JuliaLinearAlgebra/Octavian.jl)) or [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) are no longer maintained.
365371
}

0 commit comments

Comments
 (0)