@@ -22,192 +22,3 @@ Volta GV100 . This architecture has:
2222
2323![ Volta GV100] ( ../img/ch06/V100.png )
2424:label : ` ch06/ch06-gv100 `
25-
26- 1 . 6 GPU processing clusters (GPCs), each containing:
27-
28- a. 7 texture processing clusters (TPCs), each containing two
29- streaming multiprocessors (SMs).
30-
31- b. 14 SMs.
32-
33- 2 . 84 SMs, each containing:
34-
35- a. 64 32-bit floating-point arithmetic units
36-
37- b. 64 32-bit integer arithmetic units
38-
39- c. 32 64-bit floating-point arithmetic units
40-
41- d. 8 Tensor Cores
42-
43- e. 4 texture units
44-
45- 3 . 8 512-bit memory controllers.
46-
47- As shown in Figure :numref:` ch06/ch06-gv100 ` , a GV100 GPU contains 84 SMs (Streaming
48- Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
49- 32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
50- units, 672 Tensor Cores, and 336 texture units. A pair of memory
51- controllers controls an HBM2 DRAM stack. Different vendors may use
52- different configurations (e.g., Tesla V100 has 80 SMs).
53-
54- ## Memory Units
55-
56- The memory units of a hardware accelerator resemble a CPU's memory
57- controller. However, they encounter a bottleneck when retrieving data
58- from the computer system's DRAM, as it is slower compared to the
59- processor's computational speed. Without a cache for quick access, the
60- DRAM bandwidth becomes inadequate to handle all transactions of the
61- accelerator. Consequently, if program instructions or data cannot be
62- swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
63- due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
64- employ a hierarchical design of memory units. Each type of memory unit
65- offers its own maximum bandwidth and latency. To fully exploit the
66- computing power and enhance processing speed, programmers must select
67- from the available memory units and optimize memory utilization based on
68- varying access speeds.
69-
70- 1 . ** Register file** : Registers serve as the swiftest on-chip memories.
71- In contrast to CPUs, each SM in a GPU possesses tens of thousands of
72- registers. Nevertheless, excessively utilizing registers for every
73- thread can result in a reduced number of thread blocks that can be
74- scheduled within the SM, leading to fewer executable threads. This
75- underutilization of hardware capabilities hampers performance
76- considerably. Consequently, programmers must judiciously determine
77- the appropriate number of registers to employ, taking into account
78- the algorithm's demands.
79-
80- 2 . ** Shared memory** : The shared memory is a level-1 cache that is
81- user-controllable. Each SM features a 128 KB level-1 cache, with the
82- ability for programmers to manage up to 96 KB as shared memory. The
83- shared memory offers a low access latency, requiring only a few
84- dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
85- TB/s. This bandwidth is significantly higher than the peak bandwidth
86- of the global memory, which stands at 900 GB/s. In high-performance
87- computing (HPC) scenarios, engineers must possess a thorough
88- understanding of how to leverage shared memory effectively.
89-
90- 3 . ** Global memory** : Both GPUs and CPUs are capable of reading from
91- and writing to global memory. Global memory is visible and
92- accessible by all threads on a GPU, whereas other devices like CPUs
93- need to traverse buses like PCIe and NV-Link to access the global
94- memory. The global memory represents the largest memory space
95- available in a GPU, with capacities reaching over 80 GB. However, it
96- also exhibits the longest memory latency, with a load/store latency
97- that can extend to hundreds of clock cycles.
98-
99- 4 . ** Constant memory** : The constant memory is a virtual address space
100- in the global memory and does not occupy a physical memory block. It
101- serves as a high-speed memory, specifically designed for rapid
102- caching and efficient broadcasting of a single value to all threads
103- within a warp.
104-
105- 5 . ** Texture memory** : Texture memory is a specialized form of global
106- memory that is accessed through a dedicated texture cache to enhance
107- performance. In earlier GPUs without caches, the texture memory on
108- each SM served as the sole cache for data. However, the introduction
109- of level-1 and level-2 caches in modern GPUs has rendered the
110- texture memory's role as a cache obsolete. The texture memory proves
111- most beneficial in enabling GPUs to execute hardware-accelerated
112- operations while accessing memory units. For instance, it allows
113- arrays to be accessed using normalized addresses, and the retrieved
114- data can be automatically interpolated by the hardware.
115- Additionally, the texture memory supports both hardware-accelerated
116- bilinear and trilinear interpolation for 2D and 3D arrays,
117- respectively. Moreover, the texture memory facilitates automatic
118- handling of boundary conditions based on array indices. This means
119- that operations on array elements can be carried out without
120- explicit consideration of boundary situations, thus avoiding the
121- need for extra conditional branches in a thread.
122-
123- ## Compute Units
124-
125- Hardware accelerators offer a variety of compute units to efficiently
126- handle various neural networks.
127- Figure :numref:` ch06/ch06-compute-unit ` demonstrates how different
128- layers of neural networks select appropriate compute units.
129-
130- ![ Computeunits] ( ../img/ch06/compute_unit.png )
131- :label : ` ch06/ch06-compute-unit `
132-
133- 1 . ** Scalar Unit** : calculates one scalar element at a time, similar to
134- the standard reduced instruction set computer (RISC).
135-
136- 2 . ** 1D Vector Unit** : computes multiple elements at a time, similar to
137- the SIMD used in traditional CPU and GPU architectures. It has been
138- widely used in HPC and signal processing.
139-
140- 3 . ** 2D Matrix Unit** : computes the inner product of a matrix and a
141- vector or the outer product of a vector within one operation. It
142- reuses data to reduce communication costs and memory footprint,
143- which achieves the performance of matrix multiplication.
144-
145- 4 . ** 3D Cube Unit** : completes matrix multiplication within one
146- operation. Specially designed for neural network applications, it
147- can reuse data to compensate for the gap between the data
148- communication bandwidth and computing.
149-
150- The compute units on a GPU mostly include Scalar Units and 3D Cube
151- Units. As shown in Figure :numref:` ch06/ch06-SM ` , each SM has 64 32-bit floating-point
152- arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
153- floating-point arithmetic units, which are Scalar Units, and 8 Tensor
154- Cores, which are 3D Cube Units specially designed for neural network
155- applications.
156-
157- ![ Volta GV100 SM] ( ../img/ch06/SM.png )
158- :label : ` ch06/ch06-SM `
159-
160- A Tensor Core is capable of performing one $4\times4$ matrix
161- multiply-accumulate operation per clock cycle, as shown in
162- Figure :numref:` ch06/ch06-tensorcore ` .
163-
164- ```
165- D = A * B + C
166- ```
167-
168- ![ Tensor Core's $4\times4$ matrix multiply-accumulateoperation] ( ../img/ch06/tensor_core.png )
169- :label : ` ch06/ch06-tensorcore `
170-
171- $\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
172- Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
173- matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
174- Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
175- units that can deliver up to 125 Tensor Tera Floating-point Operations
176- Per Second (TFLOPS) for training and inference applications, resulting
177- in a ten-fold increase in computing speed when compared with common FP32
178- compute units.
179-
180- ## Domain Specific Architecture
181-
182- ![ Da Vinciarchitecture] ( ../img/ch06/davinci_architecture.png )
183- :label : ` ch06/ch06-davinci_architecture `
184-
185- Domain Specific Architecture (DSA) has been an area of interest in
186- meeting the fast-growing demand for computing power by deep neural
187- networks. As a typical DSA design targeting image, video, voice, and
188- text processing, neural network processing units (or namely deep
189- learning hardware accelerators) are system-on-chips (SoCs) containing
190- special compute units, large memory units, and the corresponding control
191- units. A neural processing unit, for example, Ascend chip, typically
192- consists of a control CPU, a number of AI computing engines, multi-level
193- on-chip caches or buffers, and the digital vision pre-processing (DVPP)
194- module.
195-
196- The computing core of AI chips is composed of AI Core, which is
197- responsible for executing scalar- and tensor-based arithmetic-intensive
198- computing. Consider the Ascend chip as an example. Its AI Core adopts
199- the Da Vinci architecture.
200- Figure :numref:` ch06/ch06-davinci_architecture ` shows the architecture
201- of an AI Core, which can be regarded as a simplified version of modern
202- microprocessor architecture from the control perspective. It includes
203- three types of basic computing units: Cube Unit, Vector Unit, and Scalar
204- Unit. These units are used to compute on tensors, vectors, and scalars,
205- respectively, in three independent pipelines centrally scheduled through
206- the system software to coordinate with each other for higher efficiency.
207- Similar to GPU designs, the Cube Unit functions as the computational
208- core of the AI Core and delivers parallel acceleration for matrix
209- multiply-accumulate operations. Specifically, it can multiply two
210- $16\times16$ matrices in a single instruction --- equivalent to
211- completing 4096 (=$16\times16\times16$) multiply-accumulate operations
212- within an extremely short time --- with precision comparable to FP16
213- operations.
0 commit comments