Skip to content

Commit d8959d8

Browse files
committed
final touch
1 parent 4a8bc43 commit d8959d8

File tree

2 files changed

+23
-20
lines changed

2 files changed

+23
-20
lines changed

docs/src/lecture_11/ffnn.jl

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ scatter(x[1,:], x[2,:], color = mapslices(argmax, y, dims = 1)[:])
2626
#######
2727
struct TrackedArray{T,N,V<:AbstractArray{T,N}} <: AbstractArray{T,N}
2828
value::V
29-
deriv::Union{Nothing,V}
29+
grad::Union{Nothing,V}
3030
tape::Vector{Any}
3131
end
3232

@@ -37,16 +37,16 @@ Base.size(a::TrackedArray) = size(a.value)
3737
Base.show(io::IO, ::MIME"text/plain", a::TrackedArray) = show(io, a)
3838
Base.show(io::IO, a::TrackedArray) = print(io, "TrackedArray($(size(a.value)))")
3939
value(A::TrackedArray) = A.value
40-
resetgrad!(A::TrackedArray) = (A.deriv .= 0; empty!(A.tape))
40+
resetgrad!(A::TrackedArray) = (A.grad .= 0; empty!(A.tape))
4141
value(A) = A
4242
track(A) = TrackedArray(A)
4343
track(a::Number) = TrackedArray(reshape([a], 1, 1))
4444

4545
function accum!(A::TrackedArray)
46-
isempty(A.tape) && return(A.deriv)
47-
A.deriv .= sum(g(accum!(r)) for (r, g) in A.tape)
46+
isempty(A.tape) && return(A.grad)
47+
A.grad .= sum(g(accum!(r)) for (r, g) in A.tape)
4848
empty!(A.tape)
49-
A.deriv
49+
A.grad
5050
end
5151

5252
#######
@@ -141,9 +141,9 @@ ps = params(m)
141141
@elapsed for i in 1:10000
142142
foreach(resetgrad!, ps)
143143
loss = mse(m(x), y)
144-
fill!(loss.deriv, 1)
144+
fill!(loss.grad, 1)
145145
foreach(accum!, ps)
146-
foreach(x -> x.value .-= α .* x.deriv, ps)
146+
foreach(x -> x.value .-= α .* x.grad, ps)
147147
mod(i,250) == 0 && println("loss after $(i) iterations = ", sum(value(loss)))
148148
end
149149

@@ -165,9 +165,9 @@ ps = params(m)
165165
@elapsed for i in 1:10000
166166
foreach(resetgrad!, ps)
167167
loss = mse(m(gx), gy)
168-
fill!(loss.deriv, 1)
168+
fill!(loss.grad, 1)
169169
foreach(accum!, ps)
170-
foreach(x -> x.value .-= α .* x.deriv, ps)
170+
foreach(x -> x.value .-= α .* x.grad, ps)
171171
mod(i,250) == 0 && println("loss after $(i) iterations = ", sum(value(loss)))
172172
end
173173

@@ -189,11 +189,11 @@ ps = [m₃.w, m₃.b, m₂.w, m₂.b, m₁.w, m₁.b]
189189
map(ps) do p
190190
foreach(resetgrad!, ps)
191191
loss = mse(m(x), y)
192-
fill!(loss.deriv, 1)
192+
fill!(loss.grad, 1)
193193
foreach(accum!, ps)
194194
accum!(p)
195195
θ = deepcopy(value(p))
196-
Δθ = deepcopy(p.deriv)
196+
Δθ = deepcopy(p.grad)
197197
f = θ -> begin
198198
p.value .= θ
199199
value(mse(m(x), y))

docs/src/lecture_11/lecture.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Notable functionalities / blocks are
1515

1616
* **Branch prediction** predicts which branch will be taken and execute instructions in that branch. Misses are costly, as the processor needs to roll back emptying the instruction pipeline. The processor state is nor fully restored, which leads to side-channel attacks.
1717

18-
* **speculative prefetching** load instruction / data from memory to caches in the branch that is expected to be taken advance in the hope they will be executed (depends on branch predictions)
18+
* **speculative prefetching** load instruction / data from memory to processor along the branch that is expected to be taken advance in the hope they will be executed (depends on branch predictions)
1919

2020
* **Memory management unit** is not shown but takes care of translation of virtual addresses to physical, checking the security bits of pages, etc.
2121

@@ -36,13 +36,13 @@ end
3636
and the computation of intensities `compute_insity(i, j)` does not contain many branches. Therefore the GPU goes for massive parallelism while simplifying each Core the the bare minimum leaving all difficulties up to the programmer / compiler. Below we see a high-level view on Nvidia's GPU.
3737
![nvidia-gpu](nvidia-gpu.jpg)
3838
![nvidia-gpu](nvidia-kepler.jpg)
39-
1. The chip contains many streaming multi-processors (SM). Normally, each streamming processor would be called core, as it is an indivisible unit, but NVidia decided to call "a core" a unit inside the streaming multi-processor performing stream of operations over a set of 32 registers.
39+
1. The chip contains many streaming multi-processors (SM). Normally, each streamming processor would be called core, as it is an indivisible unit, but NVidia decided to call "a core" a unit inside the streaming multi-processor performing stream of operations over a set of 32bit registers.
4040
2. Each streaming multi-processor contains (possibly multiple blocks of) 32-way (32-wide) SIMT units (Nvidia calls that *CUDA Core*), shared memory (managed cache) and register file (16k registers, but shared between all threads). Therefore a pascal P-100 can have up to 64×4×32 = 8192 cores, which certainly sounds cool in comparison to for example 24 cores of normal processors, but.
4141
3. Each streaming multi-processors (SM) has one instruction fetch and decode unit, which means that *all* CUDA cores of that SM *has to* execute the same instruction at a given cycle. This simplifies the design. The execution model therefore roughly corresponds to vector (SIMD) registers in normal CPU, but CUDA cores are not as restricted as SIMD registers. NVidia therefore calls this computation model single instruction mulptiple threads (SIMT). Main differences:
4242
1. SIMD requires the memory to be continuous while SIMT does not.
43-
2. SIMT is explicitly scalar while SIMD is explicitly vector.
43+
2. The programming mdel of SIMT is explicitly scalar while that of SIMD is explicitly vector.
4444
32 CUDA cores each operating over 32 bit registers would be equal to 1024 long vector (SIMD) registers. Modern Intel XEON processors has 256 / 512 long registers, which seems similar, as said above in order to use them the data has to be aligned in memory and sometimes they has to be on a particular offset, which might be difficult to achieve in practice (but to be fair, if the data are not aligned in GPU, the loading of data is very inefficcient).
45-
4. 16k registers per SM might seem like a lot, but they are shared between all threads. In modern GPUs, each SM supports up to 2048 threads, which means there might be just 4 32-bit registers per thread.
45+
4. 16k registers per SM might seem like a lot, but they are shared between all threads. In modern GPUs, each SM supports up to 2048 threads, which means there might be just 8 32-bit registers per thread.
4646
5. GPUs do not have virtual memory, interrupts, and cannot address external devices like keyboard and mouse.
4747
6. GPUs can switch execution contexts of "set of threads" at no cost. In comparison the context switch in CPU is relatively expensive (we need to at least save the content of registers, which is usually sped by having two sets of registers). This helps to hide latencies, when a set of threads are stalled (they wait for memory access, synchronizing with others).
4848
7. The programmer deal with "raw" storage hierarchy. This means that the programmer has to manage what will be and what will not be in cache by itself, on GPU, they are exposed and you decide what will be loaded into the cache.
@@ -57,7 +57,7 @@ The GPU works in an asynchronous mode. If we want to execute something on GPU, w
5757
3. Upload the kernel code to GPU
5858
4. Request the computation of the kernel (more on this below). GPU will put the request for the computation to its queue of works to be performed and once resources are free, it will do the compuation.
5959
5. The control is immediately returned, which means that CPU can continue doing its job.
60-
6. CPU can issue a blocking wait till the computation is done, for example fetching the results.
60+
6. CPU can issue a blocking wait till the computation is done, for example fetching the results, sychronizing with other threads in the block.
6161

6262
Let's now look at point 4. in more detail.
6363
Recall that GPU is designed for processing data in parallel, something we can abstract as
@@ -84,7 +84,7 @@ As has been mentioned above, all CUDA cores in one SM are processing the same in
8484
which can significantly decrease the throughput.
8585

8686
**Latency hiding**
87-
A thread can stall, because the instruction it depends on has not finished yet, for example due to loading data from memory, which can be very time consuming (recall that unlike SIMD, SIMT can read data from non-coallesced memory location at the price of increased latency). Instead of waiting, SM will switch to execute different set of threads, which can be executed. This context switch is does not incur any overhead, hence it can occur at single instruction granularity. It keeps SM busy effective hiding latency of expensive operations.
87+
A thread can stall, because the instruction it depends on has not finished yet, for example due to loading data from memory, which can be very time consuming (recall that unlike SIMD, SIMT can read data from non-coallesced memory location at the price of increased latency). Instead of waiting, SM will switch to execute different set of threads, which can be executed. This context switch does not incur any overhead, hence it can occur at single instruction granularity. It keeps SM busy effective hiding latency of expensive operations.
8888
![latency hiding](latency-hiding.jpg)
8989
[image taken from](https://iq.opengenus.org/key-ideas-that-makes-graphics-processing-unit-gpu-so-fast/)
9090

@@ -111,8 +111,11 @@ julia> @btime cx * cy;
111111

112112
julia> @btime CUDA.@sync cx * cy;
113113
173.704 μs (32 allocations: 624 bytes)
114+
115+
julia> @btime Matrix(CuArray(x) * CuArray(y));
116+
1.651 ms (47 allocations: 3.82 MiB)
114117
```
115-
The matrix multiplication on GPU is about 33x faster, which is likely caused by being optimized by directly NVidia, as `cx * cy` calls `CuBlas` library.
118+
The matrix multiplication on GPU is about 33x faster, which is likely caused by being optimized by directly NVidia, as `cx * cy` calls `CuBlas` library. If we add the cost of sending and retrieving data from the memory, we have 3.5x speedup, but that is not fair. The goal is to compute everything on GPU.
116119

117120
How much does it cost to send the matrix to GPU's memory? Let's measure the time of the roundtrip
118121
```julia
@@ -127,7 +130,7 @@ map(x -> sin(x)^2 + cos(x)^2, cx)
127130
reduce(max, cx)
128131
reduce(max, cx, dims = 1)
129132
```
130-
Notice that in case, the function in `map` is essentially a custom kernel. As such, the code within has to still obey (not precisely specified) rules on what can be executed as kernel. Also needless to say, that the generic `map` over `CuArray` will try to find good launch configuration (number of threads and number of blocks), which might not be an ideal for your application.
133+
Notice that in case, the function in `map` and in broadcasting is essentially a custom kernel. As such, the code within has to still obey (not precisely specified) rules on what can be executed as kernel. Also needless to say, that the generic `map` over `CuArray` will try to find good launch configuration (number of threads and number of blocks), which might not be an ideal for your application.
131134

132135
Let's now try to use CUDA on computation of Julia set, which should benefit a lot from CUDA's paralelization, as we can dispatch each pixel to each thread --- something GPUs were originally designed for.
133136

@@ -251,7 +254,7 @@ where
251254
* If `N` is not divisible by `blockDim`, then there will be **always** threads not doing anything, therefore we need to have the `if` statement that we are within bounds.
252255
* The `blockDim` has to be divisible by 32, which is the size of the warp.
253256

254-
While the `vadd` example is nice, it is trivial and can be achieved by `map` as shown above. A simple operation that is everything but trivial to impement is *reduction*, since it ends up in a single operation. It also allows to demonstrate, why efficient kernels needs to be written at three levels: warp, block, and grid. The exposition below is based on [JuliaCon tutorial on GPU programming](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb).
257+
While the `vadd` example is nice, it is trivial and can be achieved by `map` as shown above. A simple operation that is everything but trivial to implement is *reduction*, since it ends up in a single operation. It also allows to demonstrate, why efficient kernels needs to be written at three levels: warp, block, and grid. The exposition below is based on [JuliaCon tutorial on GPU programming](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb).
255258

256259
The first naive implementation might looks like
257260
```julia

0 commit comments

Comments
 (0)