Lab11: Added links to lecture.

janfrancu · janfrancu · commit 4a8bc43e7bb3 · 2022-01-06T22:59:42.000+01:00
diff --git a/docs/src/lecture_11/lab.md b/docs/src/lecture_11/lab.md
@@ -49,7 +49,7 @@ Toolchain:
 
 	[^1]: Disclaimer on `CUDA.jl`'s GitHub page: [url](https://github.com/JuliaGPU/CUDA.jl)
 
-As we have already seen in the lecture *TODO LINK*, we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usuall `x * y` syntax.
+As we have already seen in the [lecture](@ref gpu_lecture_no_kernel), we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usual `x * y` syntax.
 ```julia
 x = randn(Float32, 60, 60)
 y = randn(Float32, 60, 60)
@@ -241,7 +241,7 @@ Programming GPUs in this way is akin to using NumPy, MATLAB and other array base
 
 Note also that Julia's `CUDA.jl` is not a tensor compiler. With the exception of broadcast fusion, which is easily transferable to GPUs, there is no optimization between different kernels from the compiler point of view. Furthermore, memory allocations on GPU are handled by Julia's GC, which is single threaded and often not as aggressive, therefore similar application code can have different memory footprints on the GPU.
 
-Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see lecture *TODO LINK*. This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
+Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see [lecture](@ref gpu_lecture). This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
 
 ```julia
 using Flux, CUDA
@@ -258,7 +258,7 @@ There are two paths that lead to the necessity of programming GPUs more directly
 1. We cannot express our algorithm in terms of array operations.
 2. We want to get more out of the code,
 
-Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture]() *ADD LINK*. Which one would you choose?
+Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture](@ref gpu_lecture_yes_kernel). Which one would you choose?
 
 [^3]: There may be more of them, however these are the main ones.
 ```c
diff --git a/docs/src/lecture_11/lecture.md b/docs/src/lecture_11/lecture.md
@@ -1,4 +1,4 @@
-# GPU programming
+# [GPU programming](@id gpu_lecture)
 ## How GPU differs from CPU
 ### Hardware perspective
 **CPU** was originally created for maximal throughput of a single threadded program. Therefore the modern CPU has many parts which are not devoted to the actual computation, but to maximize the utilization of a computing resource (ALU), which now occupies relatively small part of the die. Below is the picture of a processor of Intel's Core architecture (one of the earliest in the series).
@@ -88,7 +88,7 @@ A thread can stall, because the instruction it depends on has not finished yet,
 ![latency hiding](latency-hiding.jpg)
 [image taken from](https://iq.opengenus.org/key-ideas-that-makes-graphics-processing-unit-gpu-so-fast/)
 
-## using GPU without writing kernels
+## [using GPU without writing kernels](@id gpu_lecture_no_kernel)
 Julia, as many other languages, allows to perform certain operations on GPU as you would do on CPU. Thanks to Julia's multiple dispatch, this is almost invisible and it is sufficient to convert the `Array` to `CuArray` to notify the system that array is in GPU's memory.
 
 For many widely used operations, we have available kernels, for example below, we use multiplication.
@@ -218,7 +218,7 @@ naive(cx, bags, cz);
 @btime CUDA.@sync CuArray(builtin(Array(cx), bags, Array(cz)));
 ```
 
-## Writing own CUDA kernels
+## [Writing own CUDA kernels](@id gpu_lecture_yes_kernel)
 Before diving into details, let's recall some basic from the above HW section:
 * In CUDA programming model, you usually write *kernels*, which is the *body* of the loop.
 * `N` iterations of the loop is divided into *block*s and each block into *warp*s. Single warp consists of 32 threads and these threads are executed simultaneously. All threads in the block are executed in the same SM, having access to the share memory.