Skip to content

Commit 6a7334f

Browse files
committed
debug
1 parent 0f6ff03 commit 6a7334f

File tree

1 file changed

+1
-82
lines changed

1 file changed

+1
-82
lines changed

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 1 addition & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@ accelerating an FP32 GEMM program.
1111

1212
## Implementing General Matrix Multiplication {#sec-accelerator-naive}
1313

14-
Code `lst:cpu`
15-
shows a reference implementation of GEMM in C++.
14+
Code `lst:cpu` shows a reference implementation of GEMM in C++.
1615

1716
**lst:cpu**
1817
```cpp
@@ -348,83 +347,3 @@ the instructions that load the data for the next loop.
348347

349348
For details about the complete code, see
350349
[gemm_hide_smem_latency.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu).
351-
352-
The test results are as follows:
353-
354-
Max Error: 0.000092
355-
Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
356-
357-
Analysis by Nsight Compute shows that the value of
358-
`Stall Short Scoreboard` decreases by 67% when compared with that of the
359-
previous GPU kernel function. As mentioned before, after GPU memory
360-
load/store instructions are issued, the GPU executes the next
361-
instruction without waiting for the data to be landed in the register.
362-
However, it will set a flag on the Scoreboard and reset the flag after
363-
the data is landed. If instructions that require such data need to be
364-
executed, the GPU will execute them only after the data is landed. The
365-
decrease of `Stall Short Scoreboard` demonstrates that hiding the access
366-
latency of the shared memory is an effective method to better utilize
367-
the GPU.
368-
369-
## Hiding Global Memory Loading Latency
370-
371-
To load data from the global memory, a GPU uses the textttLDG
372-
instruction, the behavior of which is similar to the `LDS` instruction
373-
used to load data from the shared memory as discussed in the previous
374-
section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
375-
instructions that load the data tiles in matrix $A$ for the next loop
376-
are issued. Because this data is not required by any inner loop in a
377-
given outer loop, the computational processes in the inner loop will not
378-
wait for the read instruction to be completed, thereby hiding the global
379-
memory loading latency. We can also enable data in `buffer` to be
380-
written to `tile` in the last loop in the inner loop after $tileK - 1$
381-
loops are executed, further reducing the latency of writing data to
382-
`tile`. Figure :numref:`hide_global_latency` shows the optimized pipeline.
383-
384-
![Pipeline that hides the global memory loadinglatency](../img/ch06/practise/hide_global_latency.png)
385-
:label:`hide_global_latency`
386-
387-
For details about the complete code, see
388-
[gemm_final.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu).
389-
390-
The test results are as follows:
391-
392-
Max Error: 0.000092
393-
Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
394-
395-
Similar to the `Stall Short Scoreboard` results obtained in the previous
396-
section, analysis by Nsight Compute shows that the value of
397-
`Stall Long Scoreboard` (a global memory indicator) decreases by 67%.
398-
Such a significant decrease demonstrates that prefetching data can hide
399-
the global memory to reduce the loading latency.
400-
401-
## Performance Optimization Principles
402-
403-
So far, we have discussed various methods to enhance the performance of
404-
an accelerator. Even though other methods exist, the principles of
405-
performance optimization generally adhere to the following:
406-
407-
- Increasing parallelism through resource mapping: Multi-level
408-
parallel resources (`blocks`, `warps`, and `threads`) are mapped to
409-
the data needing computation and transfer to enhance program
410-
parallelism.
411-
412-
- Reducing memory access latency through memory structure
413-
optimization: Based on the recognition of data reuse within the same
414-
`block` during computation, the reused data is stored in local
415-
memory (like shared memory and registers) to increase locality.
416-
417-
- Reducing the instruction issue overhead through optimizing
418-
instruction execution: The `#pragma unroll` function is used to
419-
unroll loops in order to improve the degree of parallelism at the
420-
instruction level and reduce logic judgment. The vectorized load
421-
instruction is used to increase bandwidth. For the Ampere
422-
architecture, the maximum vectorized load instruction is
423-
`LDG.E.128`, and the data type for data loading is `float4`.
424-
425-
- Hiding load/store latency by optimizing the memory access pipeline:
426-
In instances where the in-memory data undergoes modifications (such
427-
as the movement of matrix data), we can optimize the memory access
428-
pipeline. This way, the accelerator performs computations during the
429-
intervals between data movement, thereby concealing the latency
430-
associated with data movement.

0 commit comments

Comments
 (0)