Skip to content

Commit f005d16

Browse files
committed
debug
1 parent f00939c commit f005d16

File tree

1 file changed

+0
-85
lines changed

1 file changed

+0
-85
lines changed

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 0 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -31,88 +31,3 @@ for (unsigned m = 0; m < M; ++m) {
3131
}
3232
```
3333

34-
Each element in matrix $C$ is independently computed, and numerous GPU
35-
threads can be launched to compute the corresponding elements in matrix
36-
$C$ in parallel. The GPU kernel function is shown in
37-
Code `lst:gpu`.
38-
39-
**lst:gpu**
40-
```cpp
41-
__global__ void gemmKernel(const float * A,
42-
const float * B, float * C,
43-
float alpha, float beta, unsigned M, unsigned N,
44-
unsigned K) {
45-
unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
46-
unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
47-
if (m >= M || n >= N)
48-
return;
49-
float c = 0;
50-
for (unsigned k = 0; k < K; ++k) {
51-
c += A[m * K + k] * B[k * N + n];
52-
}
53-
c = c * alpha;
54-
float result = c;
55-
if (beta != 0) {
56-
result = result + C[m * N + n] * beta;
57-
}
58-
C[m * N + n] = result;
59-
}
60-
```
61-
62-
Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
63-
Each element in matrix $C$ is computed by one thread. The row index $m$
64-
and column index $n$ of the element in matrix $C$ corresponding to the
65-
thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
66-
to 11, the thread loads the row vector in matrix $A$ according to the
67-
row index and the column vector in matrix $B$ according to the column
68-
index, computes the vector inner product. The thread also stores the
69-
result back to $C$ matrix in line 17.
70-
71-
![Simple implementation ofGEMM](../img/ch06/practise/naive.png)
72-
:label:`cuda_naive_gemm`
73-
74-
The method of launching the kernel function is shown in
75-
Code `lst:launch`.
76-
77-
**lst:launch**
78-
```cpp
79-
void gemmNaive(const float *A, const float *B, float *C,
80-
float alpha, float beta, unsigned M,
81-
unsigned N, unsigned K) {
82-
dim3 block(16, 16);
83-
dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
84-
85-
gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
86-
}
87-
```
88-
89-
Each thread block processes $16\times16$ elements in matrix $C$.
90-
Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
91-
used to compute the entire matrix $C$.
92-
93-
Eigen is used to generate data and compute the GEMM result on the CPU.
94-
In addition, error computing and time profiling code are implemented for
95-
the GPU computing result. For details, see
96-
[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
97-
After the program is compiled and executed, output results are as
98-
follows:
99-
100-
Average time: 48.961 ms
101-
Max error: 0.000092
102-
103-
The peak GPU throughput can be approximated by using the following
104-
formula: 2 $\times$ Frequency $\times$ Number of single-precision
105-
compute units. The number of single-precision compute units equals the
106-
number of SMs in the GPU multiplied by the number of single-precision
107-
compute units in each SM. The results are as follows:
108-
109-
FP32 peak throughput 29767.680 GFLOPS
110-
Average Throughput: 185.313 GFLOPS
111-
112-
A significant gap exists between the performance that can be achieved by
113-
the current code and the peak device performance. In an entire computing
114-
process, the process with the highest computing density is matrix
115-
multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
116-
that time complexity of the entire computing process is
117-
$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
118-
improving performance.

0 commit comments

Comments
 (0)