Skip to content

Commit c64dd7a

Browse files
committed
debug
1 parent c0f62ea commit c64dd7a

File tree

1 file changed

+168
-0
lines changed

1 file changed

+168
-0
lines changed

chapter_accelerator/Components_of_Hardware_Accelerators.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,171 @@ Volta GV100 . This architecture has:
3434
4. 8 Tensor Cores
3535
5. 4 texture units
3636
3. 8 512-bit memory controllers.
37+
38+
As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming
39+
Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
40+
32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
41+
units, 672 Tensor Cores, and 336 texture units. A pair of memory
42+
controllers controls an HBM2 DRAM stack. Different vendors may use
43+
different configurations (e.g., Tesla V100 has 80 SMs).
44+
45+
## Memory Units
46+
47+
The memory units of a hardware accelerator resemble a CPU's memory
48+
controller. However, they encounter a bottleneck when retrieving data
49+
from the computer system's DRAM, as it is slower compared to the
50+
processor's computational speed. Without a cache for quick access, the
51+
DRAM bandwidth becomes inadequate to handle all transactions of the
52+
accelerator. Consequently, if program instructions or data cannot be
53+
swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
54+
due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
55+
employ a hierarchical design of memory units. Each type of memory unit
56+
offers its own maximum bandwidth and latency. To fully exploit the
57+
computing power and enhance processing speed, programmers must select
58+
from the available memory units and optimize memory utilization based on
59+
varying access speeds.
60+
61+
1. **Register file**: Registers serve as the swiftest on-chip memories.
62+
In contrast to CPUs, each SM in a GPU possesses tens of thousands of
63+
registers. Nevertheless, excessively utilizing registers for every
64+
thread can result in a reduced number of thread blocks that can be
65+
scheduled within the SM, leading to fewer executable threads. This
66+
underutilization of hardware capabilities hampers performance
67+
considerably. Consequently, programmers must judiciously determine
68+
the appropriate number of registers to employ, taking into account
69+
the algorithm's demands.
70+
71+
2. **Shared memory**: The shared memory is a level-1 cache that is
72+
user-controllable. Each SM features a 128 KB level-1 cache, with the
73+
ability for programmers to manage up to 96 KB as shared memory. The
74+
shared memory offers a low access latency, requiring only a few
75+
dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
76+
TB/s. This bandwidth is significantly higher than the peak bandwidth
77+
of the global memory, which stands at 900 GB/s. In high-performance
78+
computing (HPC) scenarios, engineers must possess a thorough
79+
understanding of how to leverage shared memory effectively.
80+
81+
3. **Global memory**: Both GPUs and CPUs are capable of reading from
82+
and writing to global memory. Global memory is visible and
83+
accessible by all threads on a GPU, whereas other devices like CPUs
84+
need to traverse buses like PCIe and NV-Link to access the global
85+
memory. The global memory represents the largest memory space
86+
available in a GPU, with capacities reaching over 80 GB. However, it
87+
also exhibits the longest memory latency, with a load/store latency
88+
that can extend to hundreds of clock cycles.
89+
90+
4. **Constant memory**: The constant memory is a virtual address space
91+
in the global memory and does not occupy a physical memory block. It
92+
serves as a high-speed memory, specifically designed for rapid
93+
caching and efficient broadcasting of a single value to all threads
94+
within a warp.
95+
96+
5. **Texture memory**: Texture memory is a specialized form of global
97+
memory that is accessed through a dedicated texture cache to enhance
98+
performance. In earlier GPUs without caches, the texture memory on
99+
each SM served as the sole cache for data. However, the introduction
100+
of level-1 and level-2 caches in modern GPUs has rendered the
101+
texture memory's role as a cache obsolete. The texture memory proves
102+
most beneficial in enabling GPUs to execute hardware-accelerated
103+
operations while accessing memory units. For instance, it allows
104+
arrays to be accessed using normalized addresses, and the retrieved
105+
data can be automatically interpolated by the hardware.
106+
Additionally, the texture memory supports both hardware-accelerated
107+
bilinear and trilinear interpolation for 2D and 3D arrays,
108+
respectively. Moreover, the texture memory facilitates automatic
109+
handling of boundary conditions based on array indices. This means
110+
that operations on array elements can be carried out without
111+
explicit consideration of boundary situations, thus avoiding the
112+
need for extra conditional branches in a thread.
113+
114+
## Compute Units
115+
116+
Hardware accelerators offer a variety of compute units to efficiently
117+
handle various neural networks.
118+
Figure :numref:`ch06/ch06-compute-unit` demonstrates how different
119+
layers of neural networks select appropriate compute units.
120+
121+
![Computeunits](../img/ch06/compute_unit.png)
122+
:label:`ch06/ch06-compute-unit`
123+
124+
1. **Scalar Unit**: calculates one scalar element at a time, similar to
125+
the standard reduced instruction set computer (RISC).
126+
127+
2. **1D Vector Unit**: computes multiple elements at a time, similar to
128+
the SIMD used in traditional CPU and GPU architectures. It has been
129+
widely used in HPC and signal processing.
130+
131+
3. **2D Matrix Unit**: computes the inner product of a matrix and a
132+
vector or the outer product of a vector within one operation. It
133+
reuses data to reduce communication costs and memory footprint,
134+
which achieves the performance of matrix multiplication.
135+
136+
4. **3D Cube Unit**: completes matrix multiplication within one
137+
operation. Specially designed for neural network applications, it
138+
can reuse data to compensate for the gap between the data
139+
communication bandwidth and computing.
140+
141+
The compute units on a GPU mostly include Scalar Units and 3D Cube
142+
Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point
143+
arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
144+
floating-point arithmetic units, which are Scalar Units, and 8 Tensor
145+
Cores, which are 3D Cube Units specially designed for neural network
146+
applications.
147+
148+
![Volta GV100 SM](../img/ch06/SM.png)
149+
:label:`ch06/ch06-SM`
150+
151+
A Tensor Core is capable of performing one $4\times4$ matrix
152+
multiply-accumulate operation per clock cycle, as shown in
153+
Figure :numref:`ch06/ch06-tensorcore`.
154+
155+
```
156+
D = A * B + C
157+
```
158+
159+
![Tensor Core's $4\times4$ matrix multiply-accumulateoperation](../img/ch06/tensor_core.png)
160+
:label:`ch06/ch06-tensorcore`
161+
162+
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
163+
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
164+
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
165+
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
166+
units that can deliver up to 125 Tensor Tera Floating-point Operations
167+
Per Second (TFLOPS) for training and inference applications, resulting
168+
in a ten-fold increase in computing speed when compared with common FP32
169+
compute units.
170+
171+
## Domain Specific Architecture
172+
173+
![Da Vinciarchitecture](../img/ch06/davinci_architecture.png)
174+
:label:`ch06/ch06-davinci_architecture`
175+
176+
Domain Specific Architecture (DSA) has been an area of interest in
177+
meeting the fast-growing demand for computing power by deep neural
178+
networks. As a typical DSA design targeting image, video, voice, and
179+
text processing, neural network processing units (or namely deep
180+
learning hardware accelerators) are system-on-chips (SoCs) containing
181+
special compute units, large memory units, and the corresponding control
182+
units. A neural processing unit, for example, Ascend chip, typically
183+
consists of a control CPU, a number of AI computing engines, multi-level
184+
on-chip caches or buffers, and the digital vision pre-processing (DVPP)
185+
module.
186+
187+
The computing core of AI chips is composed of AI Core, which is
188+
responsible for executing scalar- and tensor-based arithmetic-intensive
189+
computing. Consider the Ascend chip as an example. Its AI Core adopts
190+
the Da Vinci   architecture.
191+
Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture
192+
of an AI Core, which can be regarded as a simplified version of modern
193+
microprocessor architecture from the control perspective. It includes
194+
three types of basic computing units: Cube Unit, Vector Unit, and Scalar
195+
Unit. These units are used to compute on tensors, vectors, and scalars,
196+
respectively, in three independent pipelines centrally scheduled through
197+
the system software to coordinate with each other for higher efficiency.
198+
Similar to GPU designs, the Cube Unit functions as the computational
199+
core of the AI Core and delivers parallel acceleration for matrix
200+
multiply-accumulate operations. Specifically, it can multiply two
201+
$16\times16$ matrices in a single instruction --- equivalent to
202+
completing 4096 (=$16\times16\times16$) multiply-accumulate operations
203+
within an extremely short time --- with precision comparable to FP16
204+
operations.

0 commit comments

Comments
 (0)