debug

mikebo93 · mikebo93 · commit f005d165df98 · 2025-03-26T19:47:40.000Z
diff --git a/chapter_accelerator/Performance_Optimization_Methods.md b/chapter_accelerator/Performance_Optimization_Methods.md
@@ -31,88 +31,3 @@ for (unsigned m = 0; m < M; ++m) {
 }
 ```
 
-Each element in matrix $C$ is independently computed, and numerous GPU
-threads can be launched to compute the corresponding elements in matrix
-$C$ in parallel. The GPU kernel function is shown in
-Code `lst:gpu`.
-
-**lst:gpu**
-```cpp
-__global__ void gemmKernel(const float * A,
-const float * B, float * C,
-float alpha, float beta, unsigned M, unsigned N,
-unsigned K) {
-    unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
-    unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
-    if (m >= M || n >= N)
-    return;
-    float c = 0;
-    for (unsigned k = 0; k < K; ++k) {
-        c += A[m * K + k] * B[k * N + n];
-    }
-    c = c * alpha;
-    float result = c;
-    if (beta != 0) {
-        result = result + C[m * N + n] * beta;
-    }
-    C[m * N + n] = result;
-}
-```
-
-Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
-Each element in matrix $C$ is computed by one thread. The row index $m$
-and column index $n$ of the element in matrix $C$ corresponding to the
-thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
-to 11, the thread loads the row vector in matrix $A$ according to the
-row index and the column vector in matrix $B$ according to the column
-index, computes the vector inner product. The thread also stores the
-result back to $C$ matrix in line 17.
-
-![Simple implementation ofGEMM](../img/ch06/practise/naive.png)
-:label:`cuda_naive_gemm`
-
-The method of launching the kernel function is shown in
-Code `lst:launch`.
-
-**lst:launch**
-```cpp
-void gemmNaive(const float *A, const float *B, float *C,
-float alpha, float beta, unsigned M,
-unsigned N, unsigned K) {
-    dim3 block(16, 16);
-    dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
-    
-    gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
-}
-```
-
-Each thread block processes $16\times16$ elements in matrix $C$.
-Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
-used to compute the entire matrix $C$.
-
-Eigen is used to generate data and compute the GEMM result on the CPU.
-In addition, error computing and time profiling code are implemented for
-the GPU computing result. For details, see
-[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
-After the program is compiled and executed, output results are as
-follows:
-
-    Average time: 48.961 ms
-    Max error: 0.000092
-
-The peak GPU throughput can be approximated by using the following
-formula: 2 $\times$ Frequency $\times$ Number of single-precision
-compute units. The number of single-precision compute units equals the
-number of SMs in the GPU multiplied by the number of single-precision
-compute units in each SM. The results are as follows:
-
-    FP32 peak throughput 29767.680 GFLOPS
-    Average Throughput: 185.313 GFLOPS
-
-A significant gap exists between the performance that can be achieved by
-the current code and the peak device performance. In an entire computing
-process, the process with the highest computing density is matrix
-multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
-that time complexity of the entire computing process is
-$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
-improving performance.