@@ -192,3 +192,258 @@ details about the code used to achieve this, see
192192:label : ` use_tile `
193193
194194The test results are as follows:
195+
196+ ```
197+ Max Error: 0.000092
198+ Average Time: 6.232 ms, Average Throughput: 1378.317 GFLOPS
199+ ```
200+
201+ To sample and analyze performance indicators, we will use the analysis
202+ tool Nsight Compute released by NVIDIA. This tool, designed for GPU
203+ kernel functions, samples and collects GPU activity data by hooking
204+ drivers. The following commands can be used to analyze the performance:
205+
206+ ```
207+ bash
208+ ncu --set full -o <profile_output_file> <profile_process>
209+ ```
210+
211+ ` –set full ` indicates that all data is sampled. ` -o ` indicates that the
212+ result is output as a file. ` <profile_output_file> ` indicates the output
213+ file name without the file name extension. ` <profile_process> ` indicates
214+ the executable file to be analyzed and its arguments. For example, to
215+ analyze ` first_attempt ` and name the output result
216+ ` first_attepmt_prof_result ` , run the following instructions:
217+
218+ ```
219+ ncu --set full -o first_attepmt_prof_result ./first_attempt
220+ ```
221+
222+ If the system displays a message indicating that you do not have
223+ permission to run this command, prefix it with ` sudo ` and run it again.
224+ After obtaining the output file, the program ` nv-nsight-cu ` can be used
225+ to view the file. We compared the profiling results of the new GPU
226+ kernel function and the previous one.
227+
228+ The result shows that the number of ` LDG ` instructions decreases by 84%,
229+ and the value of ` Stall LG Throttle ` decreases by 33%. By using wide
230+ instructions to increase the compute density, we are able to reduce the
231+ number of global load/store instructions, thereby cutting the amount of
232+ time needed to wait before issuing instructions. The improvement on
233+ ` Arithmetic Intensity ` proves that our analysis of the arithmetic
234+ intensity is correct. The gemm_use_tile.cu test results are as follows:
235+
236+ ```
237+ Max Error: 0.000092
238+ Average Time: 3.188 ms, Average Throughput: 2694.440 GFLOPS
239+ ```
240+
241+ The analysis using Nsight Compute shows that the code can also improve
242+ other indicators, such as ` Stall LG Throttle ` .
243+
244+ ## Caching Data in Shared Memory
245+
246+ By increasing the amount of data that a thread can load in one go, we
247+ can improve the arithmetic intensity and performance. However, this
248+ method decreases the degree of parallelism because it reduces the total
249+ number of enabled threads. Other hardware features need to be exploited
250+ in order to improve performance without compromising the degree of
251+ parallelism. In earlier code, several thread blocks are enabled, each of
252+ which processes one or more matrix blocks in matrix $C$. As shown in
253+ Figure :numref:` duplicated_data ` , thread $x$ and thread $y$ process the same
254+ row in matrix $C$, so they load the same data from matrix $A$. The
255+ shared memory can be used to improve the program throughput by enabling
256+ different threads in the same thread block to load unique data and reuse
257+ shared data.
258+
259+ ![ Threads loading redundantdata] ( ../img/ch06/practise/duplicated_data.png )
260+ :label : ` duplicated_data `
261+
262+ We have previously mentioned that the inner product can be computed by
263+ loading and accumulating data in $K$ loops. Specifically, in each loop,
264+ threads that process the same row in matrix $C$ load the same data from
265+ matrix $A$, and threads that process the same column in matrix $C$ load
266+ the same data from matrix $B$. However, the code needs to be optimized
267+ by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
268+ inner loops. In this way, an entire block of data is loaded in each
269+ outer loop and accumulated in each inner loop.
270+ Figure :numref:` use_smem_store ` shows the process of moving data from the
271+ global memory to the shared memory. Before each inner loop starts, the
272+ entire ` tiles ` in matrix $A$ and matrix $B$ is stored in the shared
273+ memory.
274+
275+ Figure :numref:` use_smem_load ` shows the process of moving data from the
276+ shared memory to the register. In each inner loop, data is loaded from
277+ the shared memory and computed. An advantage of this design is that each
278+ thread does not need to load all the data it requires from the global
279+ memory. Instead, the entire thread block loads the data required for all
280+ threads from the global memory and stores the data in the shared memory.
281+ During computational processes, each thread only needs to load the data
282+ it requires from the shared memory.
283+
284+ ![ Writing data to the sharedmemory] ( ../img/ch06/practise/use_smem_store.png )
285+ :label : ` use_smem_store `
286+
287+ ![ Loading data from the sharedmemory] ( ../img/ch06/practise/use_smem_load.png )
288+ :label : ` use_smem_load `
289+
290+ For details about the complete code, see
291+ [ gemm_use_smem.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu ) .
292+
293+ The test results are as follows:
294+
295+ ```
296+ Max Error: 0.000092
297+ Average Time: 0.617 ms, Average Throughput: 13925.168 GFLOPS
298+ ```
299+
300+ Again, we use Nsight Compute to profile the kernel function and compare
301+ the results with the previous ones. The analysis shows some major
302+ improvements. Specifically, the number of ` LDG ` instructions decreases
303+ by 97%, which is consistent with this design. And the value of
304+ ` SM Utilization ` increases by 218%, which proves that using the shared
305+ memory can reduce the memory access latency and improve the memory
306+ utilization. Furthermore, the performance of other indicators such as
307+ ` Pipe Fma Cycles Active ` also improves significantly, demonstrating the
308+ benefits of the shared memory.
309+
310+ ## Reducing Register Usage
311+
312+ In previous sections, the data blocks that store matrix $A$ in the
313+ shared memory are arranged in a row-first manner, and the shared memory
314+ is loaded by row. We can instead adopt a column-first manner in order to
315+ reduce loops and loop variables, thereby reducing the number of
316+ registers and improving performance.
317+
318+ For details about the complete code, see
319+ [ gemm_transpose_smem.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_transpose_smem.cu ) .
320+
321+ The test results are as follows:
322+
323+ ```
324+ Max Error: 0.000092
325+ Average Time: 0.610 ms, Average Throughput: 14083.116 GFLOPS
326+ ```
327+
328+ Analysis by Nsight Compute shows that ` Occupancy ` increases by 1.3%.
329+ This is because only 111 registers are used (17 fewer than used by the
330+ previous GPU kernel function). The benefit of reducing the number of
331+ registers varies depending on the GPU architecture. Observations have
332+ shown that the number of ` STS ` instructions increases and bank conflicts
333+ occur, meaning that using fewer registers may not have a positive impact
334+ on other GPU architectures.
335+
336+ ## Hiding Shared Memory Loading Latency
337+
338+ To load data from the shared memory, a GPU uses the ` LDS ` instruction.
339+ After issuing this instruction, the GPU will execute the following
340+ instructions without waiting for the data to be loaded to the register
341+ unless the instructions require such data. In the previous section, each
342+ time this instruction is issued during $tileK$ inner loops, the
343+ mathematical operation that requires the loaded data is performed
344+ immediately. However, the compute unit has to wait for the data to be
345+ loaded from the shared memory, as shown in
346+ Figure :numref:` use_smem_pipeline ` . Accessing the shared memory may take
347+ dozens of clock cycles, but computation instructions can often be
348+ completed within only a few clock cycles. In order to significantly
349+ accelerate memory access, we can hide the shared memory loading latency
350+ by optimizing the pipeline. Specifically, during $tileK$ inner loops,
351+ loading instructions that prepare data in the next loop can be loaded at
352+ the beginning of each loop, as shown in
353+ Figure :numref:` hide_smem_latency ` . In this way, computation instructions in
354+ the current operation do not require the data in the next loop. As such,
355+ the execution of these computation instructions will not be blocked by
356+ the instructions that load the data for the next loop.
357+
358+ ![ Pipeline of the previous GPU kernelfunction] ( ../img/ch06/practise/use_smem_pipeline.png )
359+ :label : ` use_smem_pipeline `
360+
361+ ![ Pipeline that hides the shared memory loadinglatency] ( ../img/ch06/practise/hide_smem_latency.png )
362+ :label : ` hide_smem_latency `
363+
364+ For details about the complete code, see
365+ [ gemm_hide_smem_latency.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu ) .
366+
367+ The test results are as follows:
368+
369+ ```
370+ Max Error: 0.000092
371+ Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
372+ ```
373+
374+ Analysis by Nsight Compute shows that the value of
375+ ` Stall Short Scoreboard ` decreases by 67% when compared with that of the
376+ previous GPU kernel function. As mentioned before, after GPU memory
377+ load/store instructions are issued, the GPU executes the next
378+ instruction without waiting for the data to be landed in the register.
379+ However, it will set a flag on the Scoreboard and reset the flag after
380+ the data is landed. If instructions that require such data need to be
381+ executed, the GPU will execute them only after the data is landed. The
382+ decrease of ` Stall Short Scoreboard ` demonstrates that hiding the access
383+ latency of the shared memory is an effective method to better utilize
384+ the GPU.
385+
386+ ## Hiding Global Memory Loading Latency
387+
388+ To load data from the global memory, a GPU uses the textttLDG
389+ instruction, the behavior of which is similar to the ` LDS ` instruction
390+ used to load data from the shared memory as discussed in the previous
391+ section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
392+ instructions that load the data tiles in matrix $A$ for the next loop
393+ are issued. Because this data is not required by any inner loop in a
394+ given outer loop, the computational processes in the inner loop will not
395+ wait for the read instruction to be completed, thereby hiding the global
396+ memory loading latency. We can also enable data in ` buffer ` to be
397+ written to ` tile ` in the last loop in the inner loop after $tileK - 1$
398+ loops are executed, further reducing the latency of writing data to
399+ ` tile ` . Figure :numref:` hide_global_latency ` shows the optimized pipeline.
400+
401+ ![ Pipeline that hides the global memory loadinglatency] ( ../img/ch06/practise/hide_global_latency.png )
402+ :label : ` hide_global_latency `
403+
404+ For details about the complete code, see
405+ [ gemm_final.cu] ( https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu ) .
406+
407+ The test results are as follows:
408+
409+ ```
410+ Max Error: 0.000092
411+ Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
412+ ```
413+
414+ Similar to the ` Stall Short Scoreboard ` results obtained in the previous
415+ section, analysis by Nsight Compute shows that the value of
416+ ` Stall Long Scoreboard ` (a global memory indicator) decreases by 67%.
417+ Such a significant decrease demonstrates that prefetching data can hide
418+ the global memory to reduce the loading latency.
419+
420+ ## Performance Optimization Principles
421+
422+ So far, we have discussed various methods to enhance the performance of
423+ an accelerator. Even though other methods exist, the principles of
424+ performance optimization generally adhere to the following:
425+
426+ - Increasing parallelism through resource mapping: Multi-level
427+ parallel resources (` blocks ` , ` warps ` , and ` threads ` ) are mapped to
428+ the data needing computation and transfer to enhance program
429+ parallelism.
430+
431+ - Reducing memory access latency through memory structure
432+ optimization: Based on the recognition of data reuse within the same
433+ ` block ` during computation, the reused data is stored in local
434+ memory (like shared memory and registers) to increase locality.
435+
436+ - Reducing the instruction issue overhead through optimizing
437+ instruction execution: The ` #pragma unroll ` function is used to
438+ unroll loops in order to improve the degree of parallelism at the
439+ instruction level and reduce logic judgment. The vectorized load
440+ instruction is used to increase bandwidth. For the Ampere
441+ architecture, the maximum vectorized load instruction is
442+ ` LDG.E.128 ` , and the data type for data loading is ` float4 ` .
443+
444+ - Hiding load/store latency by optimizing the memory access pipeline:
445+ In instances where the in-memory data undergoes modifications (such
446+ as the movement of matrix data), we can optimize the memory access
447+ pipeline. This way, the accelerator performs computations during the
448+ intervals between data movement, thereby concealing the latency
449+ associated with data movement.
0 commit comments