Skip to content

Commit 8f07628

Browse files
committed
debug
1 parent 8028831 commit 8f07628

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,3 +96,14 @@ the GPU computing result. For details, see
9696
[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
9797
After the program is compiled and executed, output results are as
9898
follows:
99+
100+
FP32 peak throughput 29767.680 GFLOPS
101+
Average Throughput: 185.313 GFLOPS
102+
103+
A significant gap exists between the performance that can be achieved by
104+
the current code and the peak device performance. In an entire computing
105+
process, the process with the highest computing density is matrix
106+
multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
107+
that time complexity of the entire computing process is
108+
$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
109+
improving performance.

0 commit comments

Comments
 (0)