@@ -31,88 +31,3 @@ for (unsigned m = 0; m < M; ++m) {
3131}
3232```
3333
34- Each element in matrix $C$ is independently computed, and numerous GPU
35- threads can be launched to compute the corresponding elements in matrix
36- $C$ in parallel. The GPU kernel function is shown in
37- Code ` lst:gpu ` .
38-
39- ** lst: gpu **
40- ``` cpp
41- __global__ void gemmKernel (const float * A,
42- const float * B, float * C,
43- float alpha, float beta, unsigned M, unsigned N,
44- unsigned K) {
45- unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
46- unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
47- if (m >= M || n >= N)
48- return;
49- float c = 0;
50- for (unsigned k = 0; k < K; ++k) {
51- c += A[ m * K + k] * B[ k * N + n] ;
52- }
53- c = c * alpha;
54- float result = c;
55- if (beta != 0) {
56- result = result + C[ m * N + n] * beta;
57- }
58- C[ m * N + n] = result;
59- }
60- ```
61-
62- Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
63- Each element in matrix $C$ is computed by one thread. The row index $m$
64- and column index $n$ of the element in matrix $C$ corresponding to the
65- thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
66- to 11, the thread loads the row vector in matrix $A$ according to the
67- row index and the column vector in matrix $B$ according to the column
68- index, computes the vector inner product. The thread also stores the
69- result back to $C$ matrix in line 17.
70-
71- 
72- :label:`cuda_naive_gemm`
73-
74- The method of launching the kernel function is shown in
75- Code `lst:launch`.
76-
77- **lst:launch**
78- ```cpp
79- void gemmNaive(const float *A, const float *B, float *C,
80- float alpha, float beta, unsigned M,
81- unsigned N, unsigned K) {
82- dim3 block(16, 16);
83- dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
84-
85- gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
86- }
87- ```
88-
89- Each thread block processes $16\times16$ elements in matrix $C$.
90- Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
91- used to compute the entire matrix $C$.
92-
93- Eigen is used to generate data and compute the GEMM result on the CPU.
94- In addition, error computing and time profiling code are implemented for
95- the GPU computing result. For details, see
96- [ first_attempt.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu ) .
97- After the program is compiled and executed, output results are as
98- follows:
99-
100- Average time: 48.961 ms
101- Max error: 0.000092
102-
103- The peak GPU throughput can be approximated by using the following
104- formula: 2 $\times$ Frequency $\times$ Number of single-precision
105- compute units. The number of single-precision compute units equals the
106- number of SMs in the GPU multiplied by the number of single-precision
107- compute units in each SM. The results are as follows:
108-
109- FP32 peak throughput 29767.680 GFLOPS
110- Average Throughput: 185.313 GFLOPS
111-
112- A significant gap exists between the performance that can be achieved by
113- the current code and the peak device performance. In an entire computing
114- process, the process with the highest computing density is matrix
115- multiplication $A\times B$. Its time complexity is $O(M* N* K)$, whereas
116- that time complexity of the entire computing process is
117- $O(M* N* K+2* M* N)$. Therefore, optimizing matrix multiplication is key to
118- improving performance.
0 commit comments