Skip to content

Commit afc80d6

Browse files
committed
Fix figure ref
1 parent d95de10 commit afc80d6

File tree

2 files changed

+33
-46
lines changed

2 files changed

+33
-46
lines changed

chapter_accelerator/Components_of_Hardware_Accelerators.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -159,13 +159,16 @@ applications.
159159

160160
A Tensor Core is capable of performing one $4\times4$ matrix
161161
multiply-accumulate operation per clock cycle, as shown in
162-
Figure [4](#fig:ch06/ch06-tensorcore){reference-type="ref"
163-
reference="fig:ch06/ch06-tensorcore"}.
162+
Figure :numref:`ch06/ch06-tensorcore`.
164163

165164
D = A * B + C
166165

167166
![Tensor Core's $4\times4$ matrix multiply-accumulateoperation](../img/ch06/tensor_core.png)
168-
:label:`ch06/ch06-tensorcore}$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulationmatrices $\bf{C}$ and $\bf{D`$ can be either FP16 or FP32 matrices.
167+
:label:`ch06/ch06-tensorcore`
168+
169+
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
170+
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
171+
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
169172
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
170173
units that can deliver up to 125 Tensor Tera Floating-point Operations
171174
Per Second (TFLOPS) for training and inference applications, resulting

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 27 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,7 @@ Code [\[lst:gpu\]](#lst:gpu){reference-type="ref" reference="lst:gpu"}.
5656
C[m * N + n] = result;
5757
}
5858

59-
Figure [1](#cuda_naive_gemm){reference-type="ref"
60-
reference="cuda_naive_gemm"} shows the layout of the implementation.
59+
Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
6160
Each element in matrix $C$ is computed by one thread. The row index $m$
6261
and column index $n$ of the element in matrix $C$ corresponding to the
6362
thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
@@ -66,9 +65,8 @@ row index and the column vector in matrix $B$ according to the column
6665
index, computes the vector inner product. The thread also stores the
6766
result back to $C$ matrix in line 17.
6867

69-
![Simple implementation of
70-
GEMM](../img/ch06/practise/naive.png){#cuda_naive_gemm
71-
width=".8\\textwidth"}
68+
![Simple implementation ofGEMM](../img/ch06/practise/naive.png)
69+
:label:`cuda_naive_gemm`
7270

7371
The method of launching the kernel function is shown in
7472
Code [\[lst:launch\]](#lst:launch){reference-type="ref"
@@ -169,23 +167,21 @@ one) from matrix $A$ and matrix $B$, requiring each thread to process
169167
$4\times 4$ blocks (`thread tile`) in matrix $C$. Each thread loads data
170168
from matrix $A$ and matrix $B$ from left to right and from top to
171169
bottom, computes the data, and stores the data to matrix $C$, as shown
172-
in Figure [2](#use_float4){reference-type="ref" reference="use_float4"}.
170+
in Figure :numref:`use_float4`.
173171

174-
![Enhancing arithmetic
175-
intensity](../img/ch06/practise/use_float4.png){#use_float4
176-
width="\\textwidth"}
172+
![Enhancing arithmeticintensity](../img/ch06/practise/use_float4.png)
173+
:label:`use_float4`
177174

178175
For details about the complete code, see
179176
[gemm_use_128.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_128.cu).
180177
We can further increase the amount of data processed by each thread in
181178
order to improve the arithmetic intensity more, as shown in
182-
Figure [3](#use_tile){reference-type="ref" reference="use_tile"}. For
179+
Figure :numref:`use_tile`. For
183180
details about the code used to achieve this, see
184181
[gemm_use_tile.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_tile.cu).
185182

186-
![Further enhancement of the arithmetic intensity by adding matrix
187-
blocks processed by each
188-
thread](../img/ch06/practise/use_tile.png){#use_tile width="\\textwidth"}
183+
![Further enhancement of the arithmetic intensity by adding matrixblocks processed by eachthread](../img/ch06/practise/use_tile.png)
184+
:label:`use_tile`
189185

190186
The test results are as follows:
191187

@@ -238,16 +234,14 @@ number of enabled threads. Other hardware features need to be exploited
238234
in order to improve performance without compromising the degree of
239235
parallelism. In earlier code, several thread blocks are enabled, each of
240236
which processes one or more matrix blocks in matrix $C$. As shown in
241-
Figure [4](#duplicated_data){reference-type="ref"
242-
reference="duplicated_data"}, thread $x$ and thread $y$ process the same
237+
Figure :numref:`duplicated_data`, thread $x$ and thread $y$ process the same
243238
row in matrix $C$, so they load the same data from matrix $A$. The
244239
shared memory can be used to improve the program throughput by enabling
245240
different threads in the same thread block to load unique data and reuse
246241
shared data.
247242

248-
![Threads loading redundant
249-
data](../img/ch06/practise/duplicated_data.png){#duplicated_data
250-
width=".8\\textwidth"}
243+
![Threads loading redundantdata](../img/ch06/practise/duplicated_data.png)
244+
:label:`duplicated_data`
251245

252246
We have previously mentioned that the inner product can be computed by
253247
loading and accumulating data in $K$ loops. Specifically, in each loop,
@@ -257,14 +251,12 @@ the same data from matrix $B$. However, the code needs to be optimized
257251
by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
258252
inner loops. In this way, an entire block of data is loaded in each
259253
outer loop and accumulated in each inner loop.
260-
Figure [5](#use_smem_store){reference-type="ref"
261-
reference="use_smem_store"} shows the process of moving data from the
254+
Figure :numref:`use_smem_store` shows the process of moving data from the
262255
global memory to the shared memory. Before each inner loop starts, the
263256
entire `tiles` in matrix $A$ and matrix $B$ is stored in the shared
264257
memory.
265258

266-
Figure [6](#use_smem_load){reference-type="ref"
267-
reference="use_smem_load"} shows the process of moving data from the
259+
Figure :numref:`use_smem_load` shows the process of moving data from the
268260
shared memory to the register. In each inner loop, data is loaded from
269261
the shared memory and computed. An advantage of this design is that each
270262
thread does not need to load all the data it requires from the global
@@ -273,13 +265,11 @@ threads from the global memory and stores the data in the shared memory.
273265
During computational processes, each thread only needs to load the data
274266
it requires from the shared memory.
275267

276-
![Writing data to the shared
277-
memory](../img/ch06/practise/use_smem_store.png){#use_smem_store
278-
width="\\textwidth"}
268+
![Writing data to the sharedmemory](../img/ch06/practise/use_smem_store.png)
269+
:label:`use_smem_store`
279270

280-
![Loading data from the shared
281-
memory](../img/ch06/practise/use_smem_load.png){#use_smem_load
282-
width="\\textwidth"}
271+
![Loading data from the sharedmemory](../img/ch06/practise/use_smem_load.png)
272+
:label:`use_smem_load`
283273

284274
For details about the complete code, see
285275
[gemm_use_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu).
@@ -333,27 +323,23 @@ time this instruction is issued during $tileK$ inner loops, the
333323
mathematical operation that requires the loaded data is performed
334324
immediately. However, the compute unit has to wait for the data to be
335325
loaded from the shared memory, as shown in
336-
Figure [7](#use_smem_pipeline){reference-type="ref"
337-
reference="use_smem_pipeline"}. Accessing the shared memory may take
326+
Figure :numref:`use_smem_pipeline`. Accessing the shared memory may take
338327
dozens of clock cycles, but computation instructions can often be
339328
completed within only a few clock cycles. In order to significantly
340329
accelerate memory access, we can hide the shared memory loading latency
341330
by optimizing the pipeline. Specifically, during $tileK$ inner loops,
342331
loading instructions that prepare data in the next loop can be loaded at
343332
the beginning of each loop, as shown in
344-
Figure [8](#hide_smem_latency){reference-type="ref"
345-
reference="hide_smem_latency"}. In this way, computation instructions in
333+
Figure :numref:`hide_smem_latency`. In this way, computation instructions in
346334
the current operation do not require the data in the next loop. As such,
347335
the execution of these computation instructions will not be blocked by
348336
the instructions that load the data for the next loop.
349337

350-
![Pipeline of the previous GPU kernel
351-
function](../img/ch06/practise/use_smem_pipeline.png){#use_smem_pipeline
352-
width="\\textwidth"}
338+
![Pipeline of the previous GPU kernelfunction](../img/ch06/practise/use_smem_pipeline.png)
339+
:label:`use_smem_pipeline`
353340

354-
![Pipeline that hides the shared memory loading
355-
latency](../img/ch06/practise/hide_smem_latency.png){#hide_smem_latency
356-
width="\\textwidth"}
341+
![Pipeline that hides the shared memory loadinglatency](../img/ch06/practise/hide_smem_latency.png)
342+
:label:`hide_smem_latency`
357343

358344
For details about the complete code, see
359345
[gemm_hide_smem_latency.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu).
@@ -388,12 +374,10 @@ wait for the read instruction to be completed, thereby hiding the global
388374
memory loading latency. We can also enable data in `buffer` to be
389375
written to `tile` in the last loop in the inner loop after $tileK - 1$
390376
loops are executed, further reducing the latency of writing data to
391-
`tile`. Figure [9](#hide_global_latency){reference-type="ref"
392-
reference="hide_global_latency"} shows the optimized pipeline.
377+
`tile`. Figure :numref:`hide_global_latency` shows the optimized pipeline.
393378

394-
![Pipeline that hides the global memory loading
395-
latency](../img/ch06/practise/hide_global_latency.png){#hide_global_latency
396-
width="\\textwidth"}
379+
![Pipeline that hides the global memory loadinglatency](../img/ch06/practise/hide_global_latency.png)
380+
:label:`hide_global_latency`
397381

398382
For details about the complete code, see
399383
[gemm_final.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu).

0 commit comments

Comments
 (0)