@@ -11,8 +11,7 @@ accelerating an FP32 GEMM program.
1111
1212## Implementing General Matrix Multiplication {#sec-accelerator-naive}
1313
14- Code ` lst:cpu `
15- shows a reference implementation of GEMM in C++.
14+ Code ` lst:cpu ` shows a reference implementation of GEMM in C++.
1615
1716** lst: cpu **
1817``` cpp
@@ -348,83 +347,3 @@ the instructions that load the data for the next loop.
348347
349348For details about the complete code, see
350349[ gemm_hide_smem_latency.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu ) .
351-
352- The test results are as follows:
353-
354- Max Error: 0.000092
355- Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
356-
357- Analysis by Nsight Compute shows that the value of
358- ` Stall Short Scoreboard ` decreases by 67% when compared with that of the
359- previous GPU kernel function. As mentioned before, after GPU memory
360- load/store instructions are issued, the GPU executes the next
361- instruction without waiting for the data to be landed in the register.
362- However, it will set a flag on the Scoreboard and reset the flag after
363- the data is landed. If instructions that require such data need to be
364- executed, the GPU will execute them only after the data is landed. The
365- decrease of ` Stall Short Scoreboard ` demonstrates that hiding the access
366- latency of the shared memory is an effective method to better utilize
367- the GPU.
368-
369- ## Hiding Global Memory Loading Latency
370-
371- To load data from the global memory, a GPU uses the textttLDG
372- instruction, the behavior of which is similar to the ` LDS ` instruction
373- used to load data from the shared memory as discussed in the previous
374- section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
375- instructions that load the data tiles in matrix $A$ for the next loop
376- are issued. Because this data is not required by any inner loop in a
377- given outer loop, the computational processes in the inner loop will not
378- wait for the read instruction to be completed, thereby hiding the global
379- memory loading latency. We can also enable data in ` buffer ` to be
380- written to ` tile ` in the last loop in the inner loop after $tileK - 1$
381- loops are executed, further reducing the latency of writing data to
382- ` tile ` . Figure :numref:` hide_global_latency ` shows the optimized pipeline.
383-
384- ![ Pipeline that hides the global memory loadinglatency] ( ../img/ch06/practise/hide_global_latency.png )
385- :label : ` hide_global_latency `
386-
387- For details about the complete code, see
388- [ gemm_final.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu ) .
389-
390- The test results are as follows:
391-
392- Max Error: 0.000092
393- Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
394-
395- Similar to the ` Stall Short Scoreboard ` results obtained in the previous
396- section, analysis by Nsight Compute shows that the value of
397- ` Stall Long Scoreboard ` (a global memory indicator) decreases by 67%.
398- Such a significant decrease demonstrates that prefetching data can hide
399- the global memory to reduce the loading latency.
400-
401- ## Performance Optimization Principles
402-
403- So far, we have discussed various methods to enhance the performance of
404- an accelerator. Even though other methods exist, the principles of
405- performance optimization generally adhere to the following:
406-
407- - Increasing parallelism through resource mapping: Multi-level
408- parallel resources (` blocks ` , ` warps ` , and ` threads ` ) are mapped to
409- the data needing computation and transfer to enhance program
410- parallelism.
411-
412- - Reducing memory access latency through memory structure
413- optimization: Based on the recognition of data reuse within the same
414- ` block ` during computation, the reused data is stored in local
415- memory (like shared memory and registers) to increase locality.
416-
417- - Reducing the instruction issue overhead through optimizing
418- instruction execution: The ` #pragma unroll ` function is used to
419- unroll loops in order to improve the degree of parallelism at the
420- instruction level and reduce logic judgment. The vectorized load
421- instruction is used to increase bandwidth. For the Ampere
422- architecture, the maximum vectorized load instruction is
423- ` LDG.E.128 ` , and the data type for data loading is ` float4 ` .
424-
425- - Hiding load/store latency by optimizing the memory access pipeline:
426- In instances where the in-memory data undergoes modifications (such
427- as the movement of matrix data), we can optimize the memory access
428- pipeline. This way, the accelerator performs computations during the
429- intervals between data movement, thereby concealing the latency
430- associated with data movement.
0 commit comments