Skip to content

Commit 0f6ff03

Browse files
committed
debug
1 parent 538fc96 commit 0f6ff03

File tree

2 files changed

+641
-0
lines changed

2 files changed

+641
-0
lines changed
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Components of Hardware Accelerators
2+
3+
A hardware accelerator typically comprises multiple on-chip caches and
4+
various types of arithmetic units. In this section, we'll examine the
5+
fundamental components of hardware accelerators, using the Nvidia Volta
6+
GPU architecture as a representative example.
7+
8+
## Architecture of Accelerators
9+
10+
Contemporary graphics processing units (GPUs) offer remarkable computing
11+
speed, ample memory storage, and impressive I/O bandwidth. A top-tier
12+
GPU frequently surpasses a conventional CPU by housing double the number
13+
of transistors, boasting a memory capacity of 16 GB or greater, and
14+
operating at frequencies reaching up to 1 GHz. The architecture of a GPU
15+
comprises streaming processors and a memory system, interconnected
16+
through an on-chip network. These components can be expanded
17+
independently, allowing for customized configurations tailored to the
18+
target market of the GPU.
19+
20+
Figure :numref:`ch06/ch06-gv100` illustrates the architecture of the
21+
Volta GV100 . This architecture has:
22+
23+
![Volta GV100](../img/ch06/V100.png)
24+
:label:`ch06/ch06-gv100`
25+
26+
1. 6 GPU processing clusters (GPCs), each containing:
27+
28+
1. 7 texture processing clusters (TPCs), each containing two
29+
streaming multiprocessors (SMs).
30+
31+
2. 14 SMs.
32+
33+
2. 84 SMs, each containing:
34+
35+
1. 64 32-bit floating-point arithmetic units
36+
37+
2. 64 32-bit integer arithmetic units
38+
39+
3. 32 64-bit floating-point arithmetic units
40+
41+
4. 8 Tensor Cores
42+
43+
5. 4 texture units
44+
45+
3. 8 512-bit memory controllers.
46+
47+
As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming
48+
Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
49+
32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
50+
units, 672 Tensor Cores, and 336 texture units. A pair of memory
51+
controllers controls an HBM2 DRAM stack. Different vendors may use
52+
different configurations (e.g., Tesla V100 has 80 SMs).
53+
54+
## Memory Units
55+
56+
The memory units of a hardware accelerator resemble a CPU's memory
57+
controller. However, they encounter a bottleneck when retrieving data
58+
from the computer system's DRAM, as it is slower compared to the
59+
processor's computational speed. Without a cache for quick access, the
60+
DRAM bandwidth becomes inadequate to handle all transactions of the
61+
accelerator. Consequently, if program instructions or data cannot be
62+
swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
63+
due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
64+
employ a hierarchical design of memory units. Each type of memory unit
65+
offers its own maximum bandwidth and latency. To fully exploit the
66+
computing power and enhance processing speed, programmers must select
67+
from the available memory units and optimize memory utilization based on
68+
varying access speeds.
69+
70+
1. **Register file**: Registers serve as the swiftest on-chip memories.
71+
In contrast to CPUs, each SM in a GPU possesses tens of thousands of
72+
registers. Nevertheless, excessively utilizing registers for every
73+
thread can result in a reduced number of thread blocks that can be
74+
scheduled within the SM, leading to fewer executable threads. This
75+
underutilization of hardware capabilities hampers performance
76+
considerably. Consequently, programmers must judiciously determine
77+
the appropriate number of registers to employ, taking into account
78+
the algorithm's demands.
79+
80+
2. **Shared memory**: The shared memory is a level-1 cache that is
81+
user-controllable. Each SM features a 128 KB level-1 cache, with the
82+
ability for programmers to manage up to 96 KB as shared memory. The
83+
shared memory offers a low access latency, requiring only a few
84+
dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
85+
TB/s. This bandwidth is significantly higher than the peak bandwidth
86+
of the global memory, which stands at 900 GB/s. In high-performance
87+
computing (HPC) scenarios, engineers must possess a thorough
88+
understanding of how to leverage shared memory effectively.
89+
90+
3. **Global memory**: Both GPUs and CPUs are capable of reading from
91+
and writing to global memory. Global memory is visible and
92+
accessible by all threads on a GPU, whereas other devices like CPUs
93+
need to traverse buses like PCIe and NV-Link to access the global
94+
memory. The global memory represents the largest memory space
95+
available in a GPU, with capacities reaching over 80 GB. However, it
96+
also exhibits the longest memory latency, with a load/store latency
97+
that can extend to hundreds of clock cycles.
98+
99+
4. **Constant memory**: The constant memory is a virtual address space
100+
in the global memory and does not occupy a physical memory block. It
101+
serves as a high-speed memory, specifically designed for rapid
102+
caching and efficient broadcasting of a single value to all threads
103+
within a warp.
104+
105+
5. **Texture memory**: Texture memory is a specialized form of global
106+
memory that is accessed through a dedicated texture cache to enhance
107+
performance. In earlier GPUs without caches, the texture memory on
108+
each SM served as the sole cache for data. However, the introduction
109+
of level-1 and level-2 caches in modern GPUs has rendered the
110+
texture memory's role as a cache obsolete. The texture memory proves
111+
most beneficial in enabling GPUs to execute hardware-accelerated
112+
operations while accessing memory units. For instance, it allows
113+
arrays to be accessed using normalized addresses, and the retrieved
114+
data can be automatically interpolated by the hardware.
115+
Additionally, the texture memory supports both hardware-accelerated
116+
bilinear and trilinear interpolation for 2D and 3D arrays,
117+
respectively. Moreover, the texture memory facilitates automatic
118+
handling of boundary conditions based on array indices. This means
119+
that operations on array elements can be carried out without
120+
explicit consideration of boundary situations, thus avoiding the
121+
need for extra conditional branches in a thread.
122+
123+
## Compute Units {#Compute Units}
124+
125+
Hardware accelerators offer a variety of compute units to efficiently
126+
handle various neural networks.
127+
Figure :numref:`ch06/ch06-compute-unit` demonstrates how different
128+
layers of neural networks select appropriate compute units.
129+
130+
![Computeunits](../img/ch06/compute_unit.png)
131+
:label:`ch06/ch06-compute-unit`
132+
133+
1. **Scalar Unit**: calculates one scalar element at a time, similar to
134+
the standard reduced instruction set computer (RISC).
135+
136+
2. **1D Vector Unit**: computes multiple elements at a time, similar to
137+
the SIMD used in traditional CPU and GPU architectures. It has been
138+
widely used in HPC and signal processing.
139+
140+
3. **2D Matrix Unit**: computes the inner product of a matrix and a
141+
vector or the outer product of a vector within one operation. It
142+
reuses data to reduce communication costs and memory footprint,
143+
which achieves the performance of matrix multiplication.
144+
145+
4. **3D Cube Unit**: completes matrix multiplication within one
146+
operation. Specially designed for neural network applications, it
147+
can reuse data to compensate for the gap between the data
148+
communication bandwidth and computing.
149+
150+
The compute units on a GPU mostly include Scalar Units and 3D Cube
151+
Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point
152+
arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
153+
floating-point arithmetic units, which are Scalar Units, and 8 Tensor
154+
Cores, which are 3D Cube Units specially designed for neural network
155+
applications.
156+
157+
![Volta GV100 SM](../img/ch06/SM.png)
158+
:label:`ch06/ch06-SM`
159+
160+
A Tensor Core is capable of performing one $4\times4$ matrix
161+
multiply-accumulate operation per clock cycle, as shown in
162+
Figure :numref:`ch06/ch06-tensorcore`.
163+
164+
D = A * B + C
165+
166+
![Tensor Core's $4\times4$ matrix multiply-accumulateoperation](../img/ch06/tensor_core.png)
167+
:label:`ch06/ch06-tensorcore`
168+
169+
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
170+
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
171+
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
172+
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
173+
units that can deliver up to 125 Tensor Tera Floating-point Operations
174+
Per Second (TFLOPS) for training and inference applications, resulting
175+
in a ten-fold increase in computing speed when compared with common FP32
176+
compute units.
177+
178+
## Domain Specific Architecture
179+
180+
![Da Vinciarchitecture](../img/ch06/davinci_architecture.png)
181+
:label:`ch06/ch06-davinci_architecture`
182+
183+
Domain Specific Architecture (DSA) has been an area of interest in
184+
meeting the fast-growing demand for computing power by deep neural
185+
networks. As a typical DSA design targeting image, video, voice, and
186+
text processing, neural network processing units (or namely deep
187+
learning hardware accelerators) are system-on-chips (SoCs) containing
188+
special compute units, large memory units, and the corresponding control
189+
units. A neural processing unit, for example, Ascend chip, typically
190+
consists of a control CPU, a number of AI computing engines, multi-level
191+
on-chip caches or buffers, and the digital vision pre-processing (DVPP)
192+
module.
193+
194+
The computing core of AI chips is composed of AI Core, which is
195+
responsible for executing scalar- and tensor-based arithmetic-intensive
196+
computing. Consider the Ascend chip as an example. Its AI Core adopts
197+
the Da Vinci   architecture.
198+
Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture
199+
of an AI Core, which can be regarded as a simplified version of modern
200+
microprocessor architecture from the control perspective. It includes
201+
three types of basic computing units: Cube Unit, Vector Unit, and Scalar
202+
Unit. These units are used to compute on tensors, vectors, and scalars,
203+
respectively, in three independent pipelines centrally scheduled through
204+
the system software to coordinate with each other for higher efficiency.
205+
Similar to GPU designs, the Cube Unit functions as the computational
206+
core of the AI Core and delivers parallel acceleration for matrix
207+
multiply-accumulate operations. Specifically, it can multiply two
208+
$16\times16$ matrices in a single instruction --- equivalent to
209+
completing 4096 (=$16\times16\times16$) multiply-accumulate operations
210+
within an extremely short time --- with precision comparable to FP16
211+
operations.

0 commit comments

Comments
 (0)