@@ -8,3 +8,111 @@ shared memory, optimizing the memory load/store pipeline, among others.
88The subsequent sections will elucidate these methods through practical
99programming examples, all aimed towards a singular objective:
1010accelerating an FP32 GEMM program.
11+
12+ ## Implementing General Matrix Multiplication
13+
14+ Code ` lst:cpu ` shows a reference implementation of GEMM in C++.
15+
16+ ** lst: cpu **
17+ ``` cpp
18+ float A[M][K];
19+ float B[K][N];
20+ float C[M][N];
21+ float alpha, beta;
22+
23+ for (unsigned m = 0 ; m < M; ++m) {
24+ for (unsigned n = 0; n < N; ++n) {
25+ float c = 0;
26+ for (unsigned k = 0; k < K; ++k) {
27+ c += A[m][k] * B[k][n];
28+ }
29+ C[m][n] = alpha * c + beta * C[m][n];
30+ }
31+ }
32+ ```
33+
34+ Each element in matrix $C$ is independently computed, and numerous GPU
35+ threads can be launched to compute the corresponding elements in matrix
36+ $C$ in parallel. The GPU kernel function is shown in
37+ Code ` lst:gpu ` .
38+
39+ ** lst: gpu **
40+ ``` cpp
41+ __global__ void gemmKernel (const float * A,
42+ const float * B, float * C,
43+ float alpha, float beta, unsigned M, unsigned N,
44+ unsigned K) {
45+ unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
46+ unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
47+ if (m >= M || n >= N)
48+ return;
49+ float c = 0;
50+ for (unsigned k = 0; k < K; ++k) {
51+ c += A[ m * K + k] * B[ k * N + n] ;
52+ }
53+ c = c * alpha;
54+ float result = c;
55+ if (beta != 0) {
56+ result = result + C[ m * N + n] * beta;
57+ }
58+ C[ m * N + n] = result;
59+ }
60+ ```
61+
62+ Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
63+ Each element in matrix $C$ is computed by one thread. The row index $m$
64+ and column index $n$ of the element in matrix $C$ corresponding to the
65+ thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
66+ to 11, the thread loads the row vector in matrix $A$ according to the
67+ row index and the column vector in matrix $B$ according to the column
68+ index, computes the vector inner product. The thread also stores the
69+ result back to $C$ matrix in line 17.
70+
71+ 
72+ :label:`cuda_naive_gemm`
73+
74+ The method of launching the kernel function is shown in
75+ Code `lst:launch`.
76+
77+ **lst:launch**
78+ ```cpp
79+ void gemmNaive(const float *A, const float *B, float *C,
80+ float alpha, float beta, unsigned M,
81+ unsigned N, unsigned K) {
82+ dim3 block(16, 16);
83+ dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
84+
85+ gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
86+ }
87+ ```
88+
89+ Each thread block processes $16\times16$ elements in matrix $C$.
90+ Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
91+ used to compute the entire matrix $C$.
92+
93+ Eigen is used to generate data and compute the GEMM result on the CPU.
94+ In addition, error computing and time profiling code are implemented for
95+ the GPU computing result. For details, see
96+ [ first_attempt.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu ) .
97+ After the program is compiled and executed, output results are as
98+ follows:
99+
100+ Average time: 48.961 ms
101+ Max error: 0.000092
102+
103+ The peak GPU throughput can be approximated by using the following
104+ formula: 2 $\times$ Frequency $\times$ Number of single-precision
105+ compute units. The number of single-precision compute units equals the
106+ number of SMs in the GPU multiplied by the number of single-precision
107+ compute units in each SM. The results are as follows:
108+
109+ FP32 peak throughput 29767.680 GFLOPS
110+ Average Throughput: 185.313 GFLOPS
111+
112+ A significant gap exists between the performance that can be achieved by
113+ the current code and the peak device performance. In an entire computing
114+ process, the process with the highest computing density is matrix
115+ multiplication $A\times B$. Its time complexity is $O(M* N* K)$, whereas
116+ that time complexity of the entire computing process is
117+ $O(M* N* K+2* M* N)$. Therefore, optimizing matrix multiplication is key to
118+ improving performance.
0 commit comments