Skip to content

Commit adf5e35

Browse files
committed
debug
1 parent b6510d0 commit adf5e35

File tree

1 file changed

+0
-189
lines changed

1 file changed

+0
-189
lines changed

chapter_accelerator/Components_of_Hardware_Accelerators.md

Lines changed: 0 additions & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -22,192 +22,3 @@ Volta GV100 . This architecture has:
2222

2323
![Volta GV100](../img/ch06/V100.png)
2424
:label:`ch06/ch06-gv100`
25-
26-
1. 6 GPU processing clusters (GPCs), each containing:
27-
28-
a. 7 texture processing clusters (TPCs), each containing two
29-
streaming multiprocessors (SMs).
30-
31-
b. 14 SMs.
32-
33-
2. 84 SMs, each containing:
34-
35-
a. 64 32-bit floating-point arithmetic units
36-
37-
b. 64 32-bit integer arithmetic units
38-
39-
c. 32 64-bit floating-point arithmetic units
40-
41-
d. 8 Tensor Cores
42-
43-
e. 4 texture units
44-
45-
3. 8 512-bit memory controllers.
46-
47-
As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming
48-
Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
49-
32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
50-
units, 672 Tensor Cores, and 336 texture units. A pair of memory
51-
controllers controls an HBM2 DRAM stack. Different vendors may use
52-
different configurations (e.g., Tesla V100 has 80 SMs).
53-
54-
## Memory Units
55-
56-
The memory units of a hardware accelerator resemble a CPU's memory
57-
controller. However, they encounter a bottleneck when retrieving data
58-
from the computer system's DRAM, as it is slower compared to the
59-
processor's computational speed. Without a cache for quick access, the
60-
DRAM bandwidth becomes inadequate to handle all transactions of the
61-
accelerator. Consequently, if program instructions or data cannot be
62-
swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
63-
due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
64-
employ a hierarchical design of memory units. Each type of memory unit
65-
offers its own maximum bandwidth and latency. To fully exploit the
66-
computing power and enhance processing speed, programmers must select
67-
from the available memory units and optimize memory utilization based on
68-
varying access speeds.
69-
70-
1. **Register file**: Registers serve as the swiftest on-chip memories.
71-
In contrast to CPUs, each SM in a GPU possesses tens of thousands of
72-
registers. Nevertheless, excessively utilizing registers for every
73-
thread can result in a reduced number of thread blocks that can be
74-
scheduled within the SM, leading to fewer executable threads. This
75-
underutilization of hardware capabilities hampers performance
76-
considerably. Consequently, programmers must judiciously determine
77-
the appropriate number of registers to employ, taking into account
78-
the algorithm's demands.
79-
80-
2. **Shared memory**: The shared memory is a level-1 cache that is
81-
user-controllable. Each SM features a 128 KB level-1 cache, with the
82-
ability for programmers to manage up to 96 KB as shared memory. The
83-
shared memory offers a low access latency, requiring only a few
84-
dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
85-
TB/s. This bandwidth is significantly higher than the peak bandwidth
86-
of the global memory, which stands at 900 GB/s. In high-performance
87-
computing (HPC) scenarios, engineers must possess a thorough
88-
understanding of how to leverage shared memory effectively.
89-
90-
3. **Global memory**: Both GPUs and CPUs are capable of reading from
91-
and writing to global memory. Global memory is visible and
92-
accessible by all threads on a GPU, whereas other devices like CPUs
93-
need to traverse buses like PCIe and NV-Link to access the global
94-
memory. The global memory represents the largest memory space
95-
available in a GPU, with capacities reaching over 80 GB. However, it
96-
also exhibits the longest memory latency, with a load/store latency
97-
that can extend to hundreds of clock cycles.
98-
99-
4. **Constant memory**: The constant memory is a virtual address space
100-
in the global memory and does not occupy a physical memory block. It
101-
serves as a high-speed memory, specifically designed for rapid
102-
caching and efficient broadcasting of a single value to all threads
103-
within a warp.
104-
105-
5. **Texture memory**: Texture memory is a specialized form of global
106-
memory that is accessed through a dedicated texture cache to enhance
107-
performance. In earlier GPUs without caches, the texture memory on
108-
each SM served as the sole cache for data. However, the introduction
109-
of level-1 and level-2 caches in modern GPUs has rendered the
110-
texture memory's role as a cache obsolete. The texture memory proves
111-
most beneficial in enabling GPUs to execute hardware-accelerated
112-
operations while accessing memory units. For instance, it allows
113-
arrays to be accessed using normalized addresses, and the retrieved
114-
data can be automatically interpolated by the hardware.
115-
Additionally, the texture memory supports both hardware-accelerated
116-
bilinear and trilinear interpolation for 2D and 3D arrays,
117-
respectively. Moreover, the texture memory facilitates automatic
118-
handling of boundary conditions based on array indices. This means
119-
that operations on array elements can be carried out without
120-
explicit consideration of boundary situations, thus avoiding the
121-
need for extra conditional branches in a thread.
122-
123-
## Compute Units
124-
125-
Hardware accelerators offer a variety of compute units to efficiently
126-
handle various neural networks.
127-
Figure :numref:`ch06/ch06-compute-unit` demonstrates how different
128-
layers of neural networks select appropriate compute units.
129-
130-
![Computeunits](../img/ch06/compute_unit.png)
131-
:label:`ch06/ch06-compute-unit`
132-
133-
1. **Scalar Unit**: calculates one scalar element at a time, similar to
134-
the standard reduced instruction set computer (RISC).
135-
136-
2. **1D Vector Unit**: computes multiple elements at a time, similar to
137-
the SIMD used in traditional CPU and GPU architectures. It has been
138-
widely used in HPC and signal processing.
139-
140-
3. **2D Matrix Unit**: computes the inner product of a matrix and a
141-
vector or the outer product of a vector within one operation. It
142-
reuses data to reduce communication costs and memory footprint,
143-
which achieves the performance of matrix multiplication.
144-
145-
4. **3D Cube Unit**: completes matrix multiplication within one
146-
operation. Specially designed for neural network applications, it
147-
can reuse data to compensate for the gap between the data
148-
communication bandwidth and computing.
149-
150-
The compute units on a GPU mostly include Scalar Units and 3D Cube
151-
Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point
152-
arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
153-
floating-point arithmetic units, which are Scalar Units, and 8 Tensor
154-
Cores, which are 3D Cube Units specially designed for neural network
155-
applications.
156-
157-
![Volta GV100 SM](../img/ch06/SM.png)
158-
:label:`ch06/ch06-SM`
159-
160-
A Tensor Core is capable of performing one $4\times4$ matrix
161-
multiply-accumulate operation per clock cycle, as shown in
162-
Figure :numref:`ch06/ch06-tensorcore`.
163-
164-
```
165-
D = A * B + C
166-
```
167-
168-
![Tensor Core's $4\times4$ matrix multiply-accumulateoperation](../img/ch06/tensor_core.png)
169-
:label:`ch06/ch06-tensorcore`
170-
171-
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
172-
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
173-
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
174-
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
175-
units that can deliver up to 125 Tensor Tera Floating-point Operations
176-
Per Second (TFLOPS) for training and inference applications, resulting
177-
in a ten-fold increase in computing speed when compared with common FP32
178-
compute units.
179-
180-
## Domain Specific Architecture
181-
182-
![Da Vinciarchitecture](../img/ch06/davinci_architecture.png)
183-
:label:`ch06/ch06-davinci_architecture`
184-
185-
Domain Specific Architecture (DSA) has been an area of interest in
186-
meeting the fast-growing demand for computing power by deep neural
187-
networks. As a typical DSA design targeting image, video, voice, and
188-
text processing, neural network processing units (or namely deep
189-
learning hardware accelerators) are system-on-chips (SoCs) containing
190-
special compute units, large memory units, and the corresponding control
191-
units. A neural processing unit, for example, Ascend chip, typically
192-
consists of a control CPU, a number of AI computing engines, multi-level
193-
on-chip caches or buffers, and the digital vision pre-processing (DVPP)
194-
module.
195-
196-
The computing core of AI chips is composed of AI Core, which is
197-
responsible for executing scalar- and tensor-based arithmetic-intensive
198-
computing. Consider the Ascend chip as an example. Its AI Core adopts
199-
the Da Vinci   architecture.
200-
Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture
201-
of an AI Core, which can be regarded as a simplified version of modern
202-
microprocessor architecture from the control perspective. It includes
203-
three types of basic computing units: Cube Unit, Vector Unit, and Scalar
204-
Unit. These units are used to compute on tensors, vectors, and scalars,
205-
respectively, in three independent pipelines centrally scheduled through
206-
the system software to coordinate with each other for higher efficiency.
207-
Similar to GPU designs, the Cube Unit functions as the computational
208-
core of the AI Core and delivers parallel acceleration for matrix
209-
multiply-accumulate operations. Specifically, it can multiply two
210-
$16\times16$ matrices in a single instruction --- equivalent to
211-
completing 4096 (=$16\times16\times16$) multiply-accumulate operations
212-
within an extremely short time --- with precision comparable to FP16
213-
operations.

0 commit comments

Comments
 (0)