Skip to content

Commit f00939c

Browse files
committed
debug
1 parent eae04f5 commit f00939c

File tree

1 file changed

+108
-0
lines changed

1 file changed

+108
-0
lines changed

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,111 @@ shared memory, optimizing the memory load/store pipeline, among others.
88
The subsequent sections will elucidate these methods through practical
99
programming examples, all aimed towards a singular objective:
1010
accelerating an FP32 GEMM program.
11+
12+
## Implementing General Matrix Multiplication
13+
14+
Code `lst:cpu` shows a reference implementation of GEMM in C++.
15+
16+
**lst:cpu**
17+
```cpp
18+
float A[M][K];
19+
float B[K][N];
20+
float C[M][N];
21+
float alpha, beta;
22+
23+
for (unsigned m = 0; m < M; ++m) {
24+
for (unsigned n = 0; n < N; ++n) {
25+
float c = 0;
26+
for (unsigned k = 0; k < K; ++k) {
27+
c += A[m][k] * B[k][n];
28+
}
29+
C[m][n] = alpha * c + beta * C[m][n];
30+
}
31+
}
32+
```
33+
34+
Each element in matrix $C$ is independently computed, and numerous GPU
35+
threads can be launched to compute the corresponding elements in matrix
36+
$C$ in parallel. The GPU kernel function is shown in
37+
Code `lst:gpu`.
38+
39+
**lst:gpu**
40+
```cpp
41+
__global__ void gemmKernel(const float * A,
42+
const float * B, float * C,
43+
float alpha, float beta, unsigned M, unsigned N,
44+
unsigned K) {
45+
unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
46+
unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
47+
if (m >= M || n >= N)
48+
return;
49+
float c = 0;
50+
for (unsigned k = 0; k < K; ++k) {
51+
c += A[m * K + k] * B[k * N + n];
52+
}
53+
c = c * alpha;
54+
float result = c;
55+
if (beta != 0) {
56+
result = result + C[m * N + n] * beta;
57+
}
58+
C[m * N + n] = result;
59+
}
60+
```
61+
62+
Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
63+
Each element in matrix $C$ is computed by one thread. The row index $m$
64+
and column index $n$ of the element in matrix $C$ corresponding to the
65+
thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
66+
to 11, the thread loads the row vector in matrix $A$ according to the
67+
row index and the column vector in matrix $B$ according to the column
68+
index, computes the vector inner product. The thread also stores the
69+
result back to $C$ matrix in line 17.
70+
71+
![Simple implementation ofGEMM](../img/ch06/practise/naive.png)
72+
:label:`cuda_naive_gemm`
73+
74+
The method of launching the kernel function is shown in
75+
Code `lst:launch`.
76+
77+
**lst:launch**
78+
```cpp
79+
void gemmNaive(const float *A, const float *B, float *C,
80+
float alpha, float beta, unsigned M,
81+
unsigned N, unsigned K) {
82+
dim3 block(16, 16);
83+
dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
84+
85+
gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
86+
}
87+
```
88+
89+
Each thread block processes $16\times16$ elements in matrix $C$.
90+
Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
91+
used to compute the entire matrix $C$.
92+
93+
Eigen is used to generate data and compute the GEMM result on the CPU.
94+
In addition, error computing and time profiling code are implemented for
95+
the GPU computing result. For details, see
96+
[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
97+
After the program is compiled and executed, output results are as
98+
follows:
99+
100+
Average time: 48.961 ms
101+
Max error: 0.000092
102+
103+
The peak GPU throughput can be approximated by using the following
104+
formula: 2 $\times$ Frequency $\times$ Number of single-precision
105+
compute units. The number of single-precision compute units equals the
106+
number of SMs in the GPU multiplied by the number of single-precision
107+
compute units in each SM. The results are as follows:
108+
109+
FP32 peak throughput 29767.680 GFLOPS
110+
Average Throughput: 185.313 GFLOPS
111+
112+
A significant gap exists between the performance that can be achieved by
113+
the current code and the peak device performance. In an entire computing
114+
process, the process with the highest computing density is matrix
115+
multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
116+
that time complexity of the entire computing process is
117+
$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
118+
improving performance.

0 commit comments

Comments
 (0)