You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lecture_11/lab.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ Toolchain:
49
49
50
50
[^1]: Disclaimer on `CUDA.jl`'s GitHub page: [url](https://github.com/JuliaGPU/CUDA.jl)
51
51
52
-
As we have already seen in the lecture*TODO LINK*, we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usuall`x * y` syntax.
52
+
As we have already seen in the [lecture](@ref gpu_lecture_no_kernel), we can simply import `CUDA.jl` define some arrays, move them to the GPU and do some computation. In the following code we define two matrices `1000x1000` filled with random numbers and multiply them using usual`x * y` syntax.
53
53
```julia
54
54
x =randn(Float32, 60, 60)
55
55
y =randn(Float32, 60, 60)
@@ -241,7 +241,7 @@ Programming GPUs in this way is akin to using NumPy, MATLAB and other array base
241
241
242
242
Note also that Julia's `CUDA.jl` is not a tensor compiler. With the exception of broadcast fusion, which is easily transferable to GPUs, there is no optimization between different kernels from the compiler point of view. Furthermore, memory allocations on GPU are handled by Julia's GC, which is single threaded and often not as aggressive, therefore similar application code can have different memory footprints on the GPU.
243
243
244
-
Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see lecture*TODO LINK*. This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
244
+
Nowadays there is a big push towards simplifying programming of GPUs, mainly in the machine learning community, which often requires switching between running on GPU/CPU to be a one click deal. However this may not always yield the required results, because the GPU's computation model is different from the CPU, see [lecture](@ref gpu_lecture). This being said Julia's `Flux.jl` framework does offer such capabilities [^2]
245
245
246
246
```julia
247
247
using Flux, CUDA
@@ -258,7 +258,7 @@ There are two paths that lead to the necessity of programming GPUs more directly
258
258
1. We cannot express our algorithm in terms of array operations.
259
259
2. We want to get more out of the code,
260
260
261
-
Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture]()*ADD LINK*. Which one would you choose?
261
+
Note that the ability to write kernels in the language of your choice is not granted, as this club includes a limited amount of members - C, C++, Fortran and Julia [^3]. Consider then the following comparison between `CUDA C` and `CUDA.jl` implementation of a simple vector addition kernels as seen in the [lecture](@ref gpu_lecture_yes_kernel). Which one would you choose?
262
262
263
263
[^3]: There may be more of them, however these are the main ones.
Copy file name to clipboardExpand all lines: docs/src/lecture_11/lecture.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# GPU programming
1
+
# [GPU programming](@id gpu_lecture)
2
2
## How GPU differs from CPU
3
3
### Hardware perspective
4
4
**CPU** was originally created for maximal throughput of a single threadded program. Therefore the modern CPU has many parts which are not devoted to the actual computation, but to maximize the utilization of a computing resource (ALU), which now occupies relatively small part of the die. Below is the picture of a processor of Intel's Core architecture (one of the earliest in the series).
@@ -88,7 +88,7 @@ A thread can stall, because the instruction it depends on has not finished yet,
88
88

89
89
[image taken from](https://iq.opengenus.org/key-ideas-that-makes-graphics-processing-unit-gpu-so-fast/)
90
90
91
-
## using GPU without writing kernels
91
+
## [using GPU without writing kernels](@id gpu_lecture_no_kernel)
92
92
Julia, as many other languages, allows to perform certain operations on GPU as you would do on CPU. Thanks to Julia's multiple dispatch, this is almost invisible and it is sufficient to convert the `Array` to `CuArray` to notify the system that array is in GPU's memory.
93
93
94
94
For many widely used operations, we have available kernels, for example below, we use multiplication.
## [Writing own CUDA kernels](@id gpu_lecture_yes_kernel)
222
222
Before diving into details, let's recall some basic from the above HW section:
223
223
* In CUDA programming model, you usually write *kernels*, which is the *body* of the loop.
224
224
*`N` iterations of the loop is divided into *block*s and each block into *warp*s. Single warp consists of 32 threads and these threads are executed simultaneously. All threads in the block are executed in the same SM, having access to the share memory.
0 commit comments