Skip to content

Commit 7c29f6a

Browse files
committed
debug
1 parent ed14f82 commit 7c29f6a

File tree

1 file changed

+255
-0
lines changed

1 file changed

+255
-0
lines changed

chapter_accelerator/Performance_Optimization_Methods.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,258 @@ details about the code used to achieve this, see
192192
:label:`use_tile`
193193

194194
The test results are as follows:
195+
196+
```
197+
Max Error: 0.000092
198+
Average Time: 6.232 ms, Average Throughput: 1378.317 GFLOPS
199+
```
200+
201+
To sample and analyze performance indicators, we will use the analysis
202+
tool Nsight Compute released by NVIDIA. This tool, designed for GPU
203+
kernel functions, samples and collects GPU activity data by hooking
204+
drivers. The following commands can be used to analyze the performance:
205+
206+
```
207+
bash
208+
ncu --set full -o <profile_output_file> <profile_process>
209+
```
210+
211+
`–set full` indicates that all data is sampled. `-o` indicates that the
212+
result is output as a file. `<profile_output_file>` indicates the output
213+
file name without the file name extension. `<profile_process>` indicates
214+
the executable file to be analyzed and its arguments. For example, to
215+
analyze `first_attempt` and name the output result
216+
`first_attepmt_prof_result`, run the following instructions:
217+
218+
```
219+
ncu --set full -o first_attepmt_prof_result ./first_attempt
220+
```
221+
222+
If the system displays a message indicating that you do not have
223+
permission to run this command, prefix it with `sudo` and run it again.
224+
After obtaining the output file, the program `nv-nsight-cu` can be used
225+
to view the file. We compared the profiling results of the new GPU
226+
kernel function and the previous one.
227+
228+
The result shows that the number of `LDG` instructions decreases by 84%,
229+
and the value of `Stall LG Throttle` decreases by 33%. By using wide
230+
instructions to increase the compute density, we are able to reduce the
231+
number of global load/store instructions, thereby cutting the amount of
232+
time needed to wait before issuing instructions. The improvement on
233+
`Arithmetic Intensity` proves that our analysis of the arithmetic
234+
intensity is correct. The gemm_use_tile.cu test results are as follows:
235+
236+
```
237+
Max Error: 0.000092
238+
Average Time: 3.188 ms, Average Throughput: 2694.440 GFLOPS
239+
```
240+
241+
The analysis using Nsight Compute shows that the code can also improve
242+
other indicators, such as `Stall LG Throttle`.
243+
244+
## Caching Data in Shared Memory
245+
246+
By increasing the amount of data that a thread can load in one go, we
247+
can improve the arithmetic intensity and performance. However, this
248+
method decreases the degree of parallelism because it reduces the total
249+
number of enabled threads. Other hardware features need to be exploited
250+
in order to improve performance without compromising the degree of
251+
parallelism. In earlier code, several thread blocks are enabled, each of
252+
which processes one or more matrix blocks in matrix $C$. As shown in
253+
Figure :numref:`duplicated_data`, thread $x$ and thread $y$ process the same
254+
row in matrix $C$, so they load the same data from matrix $A$. The
255+
shared memory can be used to improve the program throughput by enabling
256+
different threads in the same thread block to load unique data and reuse
257+
shared data.
258+
259+
![Threads loading redundantdata](../img/ch06/practise/duplicated_data.png)
260+
:label:`duplicated_data`
261+
262+
We have previously mentioned that the inner product can be computed by
263+
loading and accumulating data in $K$ loops. Specifically, in each loop,
264+
threads that process the same row in matrix $C$ load the same data from
265+
matrix $A$, and threads that process the same column in matrix $C$ load
266+
the same data from matrix $B$. However, the code needs to be optimized
267+
by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
268+
inner loops. In this way, an entire block of data is loaded in each
269+
outer loop and accumulated in each inner loop.
270+
Figure :numref:`use_smem_store` shows the process of moving data from the
271+
global memory to the shared memory. Before each inner loop starts, the
272+
entire `tiles` in matrix $A$ and matrix $B$ is stored in the shared
273+
memory.
274+
275+
Figure :numref:`use_smem_load` shows the process of moving data from the
276+
shared memory to the register. In each inner loop, data is loaded from
277+
the shared memory and computed. An advantage of this design is that each
278+
thread does not need to load all the data it requires from the global
279+
memory. Instead, the entire thread block loads the data required for all
280+
threads from the global memory and stores the data in the shared memory.
281+
During computational processes, each thread only needs to load the data
282+
it requires from the shared memory.
283+
284+
![Writing data to the sharedmemory](../img/ch06/practise/use_smem_store.png)
285+
:label:`use_smem_store`
286+
287+
![Loading data from the sharedmemory](../img/ch06/practise/use_smem_load.png)
288+
:label:`use_smem_load`
289+
290+
For details about the complete code, see
291+
[gemm_use_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu).
292+
293+
The test results are as follows:
294+
295+
```
296+
Max Error: 0.000092
297+
Average Time: 0.617 ms, Average Throughput: 13925.168 GFLOPS
298+
```
299+
300+
Again, we use Nsight Compute to profile the kernel function and compare
301+
the results with the previous ones. The analysis shows some major
302+
improvements. Specifically, the number of `LDG` instructions decreases
303+
by 97%, which is consistent with this design. And the value of
304+
`SM Utilization` increases by 218%, which proves that using the shared
305+
memory can reduce the memory access latency and improve the memory
306+
utilization. Furthermore, the performance of other indicators such as
307+
`Pipe Fma Cycles Active` also improves significantly, demonstrating the
308+
benefits of the shared memory.
309+
310+
## Reducing Register Usage
311+
312+
In previous sections, the data blocks that store matrix $A$ in the
313+
shared memory are arranged in a row-first manner, and the shared memory
314+
is loaded by row. We can instead adopt a column-first manner in order to
315+
reduce loops and loop variables, thereby reducing the number of
316+
registers and improving performance.
317+
318+
For details about the complete code, see
319+
[gemm_transpose_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_transpose_smem.cu).
320+
321+
The test results are as follows:
322+
323+
```
324+
Max Error: 0.000092
325+
Average Time: 0.610 ms, Average Throughput: 14083.116 GFLOPS
326+
```
327+
328+
Analysis by Nsight Compute shows that `Occupancy` increases by 1.3%.
329+
This is because only 111 registers are used (17 fewer than used by the
330+
previous GPU kernel function). The benefit of reducing the number of
331+
registers varies depending on the GPU architecture. Observations have
332+
shown that the number of `STS` instructions increases and bank conflicts
333+
occur, meaning that using fewer registers may not have a positive impact
334+
on other GPU architectures.
335+
336+
## Hiding Shared Memory Loading Latency
337+
338+
To load data from the shared memory, a GPU uses the `LDS` instruction.
339+
After issuing this instruction, the GPU will execute the following
340+
instructions without waiting for the data to be loaded to the register
341+
unless the instructions require such data. In the previous section, each
342+
time this instruction is issued during $tileK$ inner loops, the
343+
mathematical operation that requires the loaded data is performed
344+
immediately. However, the compute unit has to wait for the data to be
345+
loaded from the shared memory, as shown in
346+
Figure :numref:`use_smem_pipeline`. Accessing the shared memory may take
347+
dozens of clock cycles, but computation instructions can often be
348+
completed within only a few clock cycles. In order to significantly
349+
accelerate memory access, we can hide the shared memory loading latency
350+
by optimizing the pipeline. Specifically, during $tileK$ inner loops,
351+
loading instructions that prepare data in the next loop can be loaded at
352+
the beginning of each loop, as shown in
353+
Figure :numref:`hide_smem_latency`. In this way, computation instructions in
354+
the current operation do not require the data in the next loop. As such,
355+
the execution of these computation instructions will not be blocked by
356+
the instructions that load the data for the next loop.
357+
358+
![Pipeline of the previous GPU kernelfunction](../img/ch06/practise/use_smem_pipeline.png)
359+
:label:`use_smem_pipeline`
360+
361+
![Pipeline that hides the shared memory loadinglatency](../img/ch06/practise/hide_smem_latency.png)
362+
:label:`hide_smem_latency`
363+
364+
For details about the complete code, see
365+
[gemm_hide_smem_latency.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu).
366+
367+
The test results are as follows:
368+
369+
```
370+
Max Error: 0.000092
371+
Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
372+
```
373+
374+
Analysis by Nsight Compute shows that the value of
375+
`Stall Short Scoreboard` decreases by 67% when compared with that of the
376+
previous GPU kernel function. As mentioned before, after GPU memory
377+
load/store instructions are issued, the GPU executes the next
378+
instruction without waiting for the data to be landed in the register.
379+
However, it will set a flag on the Scoreboard and reset the flag after
380+
the data is landed. If instructions that require such data need to be
381+
executed, the GPU will execute them only after the data is landed. The
382+
decrease of `Stall Short Scoreboard` demonstrates that hiding the access
383+
latency of the shared memory is an effective method to better utilize
384+
the GPU.
385+
386+
## Hiding Global Memory Loading Latency
387+
388+
To load data from the global memory, a GPU uses the textttLDG
389+
instruction, the behavior of which is similar to the `LDS` instruction
390+
used to load data from the shared memory as discussed in the previous
391+
section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
392+
instructions that load the data tiles in matrix $A$ for the next loop
393+
are issued. Because this data is not required by any inner loop in a
394+
given outer loop, the computational processes in the inner loop will not
395+
wait for the read instruction to be completed, thereby hiding the global
396+
memory loading latency. We can also enable data in `buffer` to be
397+
written to `tile` in the last loop in the inner loop after $tileK - 1$
398+
loops are executed, further reducing the latency of writing data to
399+
`tile`. Figure :numref:`hide_global_latency` shows the optimized pipeline.
400+
401+
![Pipeline that hides the global memory loadinglatency](../img/ch06/practise/hide_global_latency.png)
402+
:label:`hide_global_latency`
403+
404+
For details about the complete code, see
405+
[gemm_final.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu).
406+
407+
The test results are as follows:
408+
409+
```
410+
Max Error: 0.000092
411+
Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
412+
```
413+
414+
Similar to the `Stall Short Scoreboard` results obtained in the previous
415+
section, analysis by Nsight Compute shows that the value of
416+
`Stall Long Scoreboard` (a global memory indicator) decreases by 67%.
417+
Such a significant decrease demonstrates that prefetching data can hide
418+
the global memory to reduce the loading latency.
419+
420+
## Performance Optimization Principles
421+
422+
So far, we have discussed various methods to enhance the performance of
423+
an accelerator. Even though other methods exist, the principles of
424+
performance optimization generally adhere to the following:
425+
426+
- Increasing parallelism through resource mapping: Multi-level
427+
parallel resources (`blocks`, `warps`, and `threads`) are mapped to
428+
the data needing computation and transfer to enhance program
429+
parallelism.
430+
431+
- Reducing memory access latency through memory structure
432+
optimization: Based on the recognition of data reuse within the same
433+
`block` during computation, the reused data is stored in local
434+
memory (like shared memory and registers) to increase locality.
435+
436+
- Reducing the instruction issue overhead through optimizing
437+
instruction execution: The `#pragma unroll` function is used to
438+
unroll loops in order to improve the degree of parallelism at the
439+
instruction level and reduce logic judgment. The vectorized load
440+
instruction is used to increase bandwidth. For the Ampere
441+
architecture, the maximum vectorized load instruction is
442+
`LDG.E.128`, and the data type for data loading is `float4`.
443+
444+
- Hiding load/store latency by optimizing the memory access pipeline:
445+
In instances where the in-memory data undergoes modifications (such
446+
as the movement of matrix data), we can optimize the memory access
447+
pipeline. This way, the accelerator performs computations during the
448+
intervals between data movement, thereby concealing the latency
449+
associated with data movement.

0 commit comments

Comments
 (0)