@@ -56,8 +56,7 @@ Code [\[lst:gpu\]](#lst:gpu){reference-type="ref" reference="lst:gpu"}.
5656 C[m * N + n] = result;
5757 }
5858
59- Figure [ 1] ( #cuda_naive_gemm ) {reference-type="ref"
60- reference="cuda_naive_gemm"} shows the layout of the implementation.
59+ Figure :numref:` cuda_naive_gemm ` shows the layout of the implementation.
6160Each element in matrix $C$ is computed by one thread. The row index $m$
6261and column index $n$ of the element in matrix $C$ corresponding to the
6362thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
@@ -66,9 +65,8 @@ row index and the column vector in matrix $B$ according to the column
6665index, computes the vector inner product. The thread also stores the
6766result back to $C$ matrix in line 17.
6867
69- ![ Simple implementation of
70- GEMM] ( ../img/ch06/practise/naive.png ) {#cuda_naive_gemm
71- width=".8\\ textwidth"}
68+ ![ Simple implementation ofGEMM] ( ../img/ch06/practise/naive.png )
69+ :label : ` cuda_naive_gemm `
7270
7371The method of launching the kernel function is shown in
7472Code [ \[ lst: launch \] ] ( #lst:launch ) {reference-type="ref"
@@ -169,23 +167,21 @@ one) from matrix $A$ and matrix $B$, requiring each thread to process
169167$4\times 4$ blocks (` thread tile ` ) in matrix $C$. Each thread loads data
170168from matrix $A$ and matrix $B$ from left to right and from top to
171169bottom, computes the data, and stores the data to matrix $C$, as shown
172- in Figure [ 2 ] ( # use_float4) {reference-type="ref" reference="use_float4"} .
170+ in Figure :numref: ` use_float4 ` .
173171
174- ![ Enhancing arithmetic
175- intensity] ( ../img/ch06/practise/use_float4.png ) {#use_float4
176- width="\\ textwidth"}
172+ ![ Enhancing arithmeticintensity] ( ../img/ch06/practise/use_float4.png )
173+ :label : ` use_float4 `
177174
178175For details about the complete code, see
179176[ gemm_use_128.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_128.cu ) .
180177We can further increase the amount of data processed by each thread in
181178order to improve the arithmetic intensity more, as shown in
182- Figure [ 3 ] ( # use_tile) {reference-type="ref" reference="use_tile"} . For
179+ Figure :numref: ` use_tile ` . For
183180details about the code used to achieve this, see
184181[ gemm_use_tile.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_tile.cu ) .
185182
186- ![ Further enhancement of the arithmetic intensity by adding matrix
187- blocks processed by each
188- thread] ( ../img/ch06/practise/use_tile.png ) {#use_tile width="\\ textwidth"}
183+ ![ Further enhancement of the arithmetic intensity by adding matrixblocks processed by eachthread] ( ../img/ch06/practise/use_tile.png )
184+ :label : ` use_tile `
189185
190186The test results are as follows:
191187
@@ -238,16 +234,14 @@ number of enabled threads. Other hardware features need to be exploited
238234in order to improve performance without compromising the degree of
239235parallelism. In earlier code, several thread blocks are enabled, each of
240236which processes one or more matrix blocks in matrix $C$. As shown in
241- Figure [ 4] ( #duplicated_data ) {reference-type="ref"
242- reference="duplicated_data"}, thread $x$ and thread $y$ process the same
237+ Figure :numref:` duplicated_data ` , thread $x$ and thread $y$ process the same
243238row in matrix $C$, so they load the same data from matrix $A$. The
244239shared memory can be used to improve the program throughput by enabling
245240different threads in the same thread block to load unique data and reuse
246241shared data.
247242
248- ![ Threads loading redundant
249- data] ( ../img/ch06/practise/duplicated_data.png ) {#duplicated_data
250- width=".8\\ textwidth"}
243+ ![ Threads loading redundantdata] ( ../img/ch06/practise/duplicated_data.png )
244+ :label : ` duplicated_data `
251245
252246We have previously mentioned that the inner product can be computed by
253247loading and accumulating data in $K$ loops. Specifically, in each loop,
@@ -257,14 +251,12 @@ the same data from matrix $B$. However, the code needs to be optimized
257251by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
258252inner loops. In this way, an entire block of data is loaded in each
259253outer loop and accumulated in each inner loop.
260- Figure [ 5] ( #use_smem_store ) {reference-type="ref"
261- reference="use_smem_store"} shows the process of moving data from the
254+ Figure :numref:` use_smem_store ` shows the process of moving data from the
262255global memory to the shared memory. Before each inner loop starts, the
263256entire ` tiles ` in matrix $A$ and matrix $B$ is stored in the shared
264257memory.
265258
266- Figure [ 6] ( #use_smem_load ) {reference-type="ref"
267- reference="use_smem_load"} shows the process of moving data from the
259+ Figure :numref:` use_smem_load ` shows the process of moving data from the
268260shared memory to the register. In each inner loop, data is loaded from
269261the shared memory and computed. An advantage of this design is that each
270262thread does not need to load all the data it requires from the global
@@ -273,13 +265,11 @@ threads from the global memory and stores the data in the shared memory.
273265During computational processes, each thread only needs to load the data
274266it requires from the shared memory.
275267
276- ![ Writing data to the shared
277- memory] ( ../img/ch06/practise/use_smem_store.png ) {#use_smem_store
278- width="\\ textwidth"}
268+ ![ Writing data to the sharedmemory] ( ../img/ch06/practise/use_smem_store.png )
269+ :label : ` use_smem_store `
279270
280- ![ Loading data from the shared
281- memory] ( ../img/ch06/practise/use_smem_load.png ) {#use_smem_load
282- width="\\ textwidth"}
271+ ![ Loading data from the sharedmemory] ( ../img/ch06/practise/use_smem_load.png )
272+ :label : ` use_smem_load `
283273
284274For details about the complete code, see
285275[ gemm_use_smem.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu ) .
@@ -333,27 +323,23 @@ time this instruction is issued during $tileK$ inner loops, the
333323mathematical operation that requires the loaded data is performed
334324immediately. However, the compute unit has to wait for the data to be
335325loaded from the shared memory, as shown in
336- Figure [ 7] ( #use_smem_pipeline ) {reference-type="ref"
337- reference="use_smem_pipeline"}. Accessing the shared memory may take
326+ Figure :numref:` use_smem_pipeline ` . Accessing the shared memory may take
338327dozens of clock cycles, but computation instructions can often be
339328completed within only a few clock cycles. In order to significantly
340329accelerate memory access, we can hide the shared memory loading latency
341330by optimizing the pipeline. Specifically, during $tileK$ inner loops,
342331loading instructions that prepare data in the next loop can be loaded at
343332the beginning of each loop, as shown in
344- Figure [ 8] ( #hide_smem_latency ) {reference-type="ref"
345- reference="hide_smem_latency"}. In this way, computation instructions in
333+ Figure :numref:` hide_smem_latency ` . In this way, computation instructions in
346334the current operation do not require the data in the next loop. As such,
347335the execution of these computation instructions will not be blocked by
348336the instructions that load the data for the next loop.
349337
350- ![ Pipeline of the previous GPU kernel
351- function] ( ../img/ch06/practise/use_smem_pipeline.png ) {#use_smem_pipeline
352- width="\\ textwidth"}
338+ ![ Pipeline of the previous GPU kernelfunction] ( ../img/ch06/practise/use_smem_pipeline.png )
339+ :label : ` use_smem_pipeline `
353340
354- ![ Pipeline that hides the shared memory loading
355- latency] ( ../img/ch06/practise/hide_smem_latency.png ) {#hide_smem_latency
356- width="\\ textwidth"}
341+ ![ Pipeline that hides the shared memory loadinglatency] ( ../img/ch06/practise/hide_smem_latency.png )
342+ :label : ` hide_smem_latency `
357343
358344For details about the complete code, see
359345[ gemm_hide_smem_latency.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu ) .
@@ -388,12 +374,10 @@ wait for the read instruction to be completed, thereby hiding the global
388374memory loading latency. We can also enable data in ` buffer ` to be
389375written to ` tile ` in the last loop in the inner loop after $tileK - 1$
390376loops are executed, further reducing the latency of writing data to
391- ` tile ` . Figure [ 9] ( #hide_global_latency ) {reference-type="ref"
392- reference="hide_global_latency"} shows the optimized pipeline.
377+ ` tile ` . Figure :numref:` hide_global_latency ` shows the optimized pipeline.
393378
394- ![ Pipeline that hides the global memory loading
395- latency] ( ../img/ch06/practise/hide_global_latency.png ) {#hide_global_latency
396- width="\\ textwidth"}
379+ ![ Pipeline that hides the global memory loadinglatency] ( ../img/ch06/practise/hide_global_latency.png )
380+ :label : ` hide_global_latency `
397381
398382For details about the complete code, see
399383[ gemm_final.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu ) .
0 commit comments