Skip to content

Commit 4a8bc43

Browse files
committed
Lab11: Added links to lecture.
1 parent 8c57c53 commit 4a8bc43

File tree

2 files changed

+6
-6
lines changed

2 files changed

+6
-6
lines changed

docs/src/lecture_11/lab.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ Toolchain:
4949

5050
[^1]: Disclaimer on `CUDA.jl`'s GitHub page: [url](https://github.com/JuliaGPU/CUDA.jl)
5151

52-
As we have already seen in the lecture *TODO LINK*, we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usuall `x * y` syntax.
52+
As we have already seen in the [lecture](@ref gpu_lecture_no_kernel), we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usual `x * y` syntax.
5353
```julia
5454
x = randn(Float32, 60, 60)
5555
y = randn(Float32, 60, 60)
@@ -241,7 +241,7 @@ Programming GPUs in this way is akin to using NumPy, MATLAB and other array base
241241

242242
Note also that Julia's `CUDA.jl` is not a tensor compiler. With the exception of broadcast fusion, which is easily transferable to GPUs, there is no optimization between different kernels from the compiler point of view. Furthermore, memory allocations on GPU are handled by Julia's GC, which is single threaded and often not as aggressive, therefore similar application code can have different memory footprints on the GPU.
243243

244-
Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see lecture *TODO LINK*. This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
244+
Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see [lecture](@ref gpu_lecture). This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
245245

246246
```julia
247247
using Flux, CUDA
@@ -258,7 +258,7 @@ There are two paths that lead to the necessity of programming GPUs more directly
258258
1. We cannot express our algorithm in terms of array operations.
259259
2. We want to get more out of the code,
260260

261-
Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture]() *ADD LINK*. Which one would you choose?
261+
Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture](@ref gpu_lecture_yes_kernel). Which one would you choose?
262262

263263
[^3]: There may be more of them, however these are the main ones.
264264
```c

docs/src/lecture_11/lecture.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# GPU programming
1+
# [GPU programming](@id gpu_lecture)
22
## How GPU differs from CPU
33
### Hardware perspective
44
**CPU** was originally created for maximal throughput of a single threadded program. Therefore the modern CPU has many parts which are not devoted to the actual computation, but to maximize the utilization of a computing resource (ALU), which now occupies relatively small part of the die. Below is the picture of a processor of Intel's Core architecture (one of the earliest in the series).
@@ -88,7 +88,7 @@ A thread can stall, because the instruction it depends on has not finished yet,
8888
![latency hiding](latency-hiding.jpg)
8989
[image taken from](https://iq.opengenus.org/key-ideas-that-makes-graphics-processing-unit-gpu-so-fast/)
9090

91-
## using GPU without writing kernels
91+
## [using GPU without writing kernels](@id gpu_lecture_no_kernel)
9292
Julia, as many other languages, allows to perform certain operations on GPU as you would do on CPU. Thanks to Julia's multiple dispatch, this is almost invisible and it is sufficient to convert the `Array` to `CuArray` to notify the system that array is in GPU's memory.
9393

9494
For many widely used operations, we have available kernels, for example below, we use multiplication.
@@ -218,7 +218,7 @@ naive(cx, bags, cz);
218218
@btime CUDA.@sync CuArray(builtin(Array(cx), bags, Array(cz)));
219219
```
220220

221-
## Writing own CUDA kernels
221+
## [Writing own CUDA kernels](@id gpu_lecture_yes_kernel)
222222
Before diving into details, let's recall some basic from the above HW section:
223223
* In CUDA programming model, you usually write *kernels*, which is the *body* of the loop.
224224
* `N` iterations of the loop is divided into *block*s and each block into *warp*s. Single warp consists of 32 threads and these threads are executed simultaneously. All threads in the block are executed in the same SM, having access to the share memory.

0 commit comments

Comments
 (0)