@@ -34,3 +34,171 @@ Volta GV100 . This architecture has:
3434 4 . 8 Tensor Cores
3535 5 . 4 texture units
36363 . 8 512-bit memory controllers.
37+
38+ As shown in Figure :numref:` ch06/ch06-gv100 ` , a GV100 GPU contains 84 SMs (Streaming
39+ Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
40+ 32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
41+ units, 672 Tensor Cores, and 336 texture units. A pair of memory
42+ controllers controls an HBM2 DRAM stack. Different vendors may use
43+ different configurations (e.g., Tesla V100 has 80 SMs).
44+
45+ ## Memory Units
46+
47+ The memory units of a hardware accelerator resemble a CPU's memory
48+ controller. However, they encounter a bottleneck when retrieving data
49+ from the computer system's DRAM, as it is slower compared to the
50+ processor's computational speed. Without a cache for quick access, the
51+ DRAM bandwidth becomes inadequate to handle all transactions of the
52+ accelerator. Consequently, if program instructions or data cannot be
53+ swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
54+ due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
55+ employ a hierarchical design of memory units. Each type of memory unit
56+ offers its own maximum bandwidth and latency. To fully exploit the
57+ computing power and enhance processing speed, programmers must select
58+ from the available memory units and optimize memory utilization based on
59+ varying access speeds.
60+
61+ 1 . ** Register file** : Registers serve as the swiftest on-chip memories.
62+ In contrast to CPUs, each SM in a GPU possesses tens of thousands of
63+ registers. Nevertheless, excessively utilizing registers for every
64+ thread can result in a reduced number of thread blocks that can be
65+ scheduled within the SM, leading to fewer executable threads. This
66+ underutilization of hardware capabilities hampers performance
67+ considerably. Consequently, programmers must judiciously determine
68+ the appropriate number of registers to employ, taking into account
69+ the algorithm's demands.
70+
71+ 2 . ** Shared memory** : The shared memory is a level-1 cache that is
72+ user-controllable. Each SM features a 128 KB level-1 cache, with the
73+ ability for programmers to manage up to 96 KB as shared memory. The
74+ shared memory offers a low access latency, requiring only a few
75+ dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
76+ TB/s. This bandwidth is significantly higher than the peak bandwidth
77+ of the global memory, which stands at 900 GB/s. In high-performance
78+ computing (HPC) scenarios, engineers must possess a thorough
79+ understanding of how to leverage shared memory effectively.
80+
81+ 3 . ** Global memory** : Both GPUs and CPUs are capable of reading from
82+ and writing to global memory. Global memory is visible and
83+ accessible by all threads on a GPU, whereas other devices like CPUs
84+ need to traverse buses like PCIe and NV-Link to access the global
85+ memory. The global memory represents the largest memory space
86+ available in a GPU, with capacities reaching over 80 GB. However, it
87+ also exhibits the longest memory latency, with a load/store latency
88+ that can extend to hundreds of clock cycles.
89+
90+ 4 . ** Constant memory** : The constant memory is a virtual address space
91+ in the global memory and does not occupy a physical memory block. It
92+ serves as a high-speed memory, specifically designed for rapid
93+ caching and efficient broadcasting of a single value to all threads
94+ within a warp.
95+
96+ 5 . ** Texture memory** : Texture memory is a specialized form of global
97+ memory that is accessed through a dedicated texture cache to enhance
98+ performance. In earlier GPUs without caches, the texture memory on
99+ each SM served as the sole cache for data. However, the introduction
100+ of level-1 and level-2 caches in modern GPUs has rendered the
101+ texture memory's role as a cache obsolete. The texture memory proves
102+ most beneficial in enabling GPUs to execute hardware-accelerated
103+ operations while accessing memory units. For instance, it allows
104+ arrays to be accessed using normalized addresses, and the retrieved
105+ data can be automatically interpolated by the hardware.
106+ Additionally, the texture memory supports both hardware-accelerated
107+ bilinear and trilinear interpolation for 2D and 3D arrays,
108+ respectively. Moreover, the texture memory facilitates automatic
109+ handling of boundary conditions based on array indices. This means
110+ that operations on array elements can be carried out without
111+ explicit consideration of boundary situations, thus avoiding the
112+ need for extra conditional branches in a thread.
113+
114+ ## Compute Units
115+
116+ Hardware accelerators offer a variety of compute units to efficiently
117+ handle various neural networks.
118+ Figure :numref:` ch06/ch06-compute-unit ` demonstrates how different
119+ layers of neural networks select appropriate compute units.
120+
121+ ![ Computeunits] ( ../img/ch06/compute_unit.png )
122+ :label : ` ch06/ch06-compute-unit `
123+
124+ 1 . ** Scalar Unit** : calculates one scalar element at a time, similar to
125+ the standard reduced instruction set computer (RISC).
126+
127+ 2 . ** 1D Vector Unit** : computes multiple elements at a time, similar to
128+ the SIMD used in traditional CPU and GPU architectures. It has been
129+ widely used in HPC and signal processing.
130+
131+ 3 . ** 2D Matrix Unit** : computes the inner product of a matrix and a
132+ vector or the outer product of a vector within one operation. It
133+ reuses data to reduce communication costs and memory footprint,
134+ which achieves the performance of matrix multiplication.
135+
136+ 4 . ** 3D Cube Unit** : completes matrix multiplication within one
137+ operation. Specially designed for neural network applications, it
138+ can reuse data to compensate for the gap between the data
139+ communication bandwidth and computing.
140+
141+ The compute units on a GPU mostly include Scalar Units and 3D Cube
142+ Units. As shown in Figure :numref:` ch06/ch06-SM ` , each SM has 64 32-bit floating-point
143+ arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
144+ floating-point arithmetic units, which are Scalar Units, and 8 Tensor
145+ Cores, which are 3D Cube Units specially designed for neural network
146+ applications.
147+
148+ ![ Volta GV100 SM] ( ../img/ch06/SM.png )
149+ :label : ` ch06/ch06-SM `
150+
151+ A Tensor Core is capable of performing one $4\times4$ matrix
152+ multiply-accumulate operation per clock cycle, as shown in
153+ Figure :numref:` ch06/ch06-tensorcore ` .
154+
155+ ```
156+ D = A * B + C
157+ ```
158+
159+ ![ Tensor Core's $4\times4$ matrix multiply-accumulateoperation] ( ../img/ch06/tensor_core.png )
160+ :label : ` ch06/ch06-tensorcore `
161+
162+ $\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
163+ Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
164+ matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
165+ Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
166+ units that can deliver up to 125 Tensor Tera Floating-point Operations
167+ Per Second (TFLOPS) for training and inference applications, resulting
168+ in a ten-fold increase in computing speed when compared with common FP32
169+ compute units.
170+
171+ ## Domain Specific Architecture
172+
173+ ![ Da Vinciarchitecture] ( ../img/ch06/davinci_architecture.png )
174+ :label : ` ch06/ch06-davinci_architecture `
175+
176+ Domain Specific Architecture (DSA) has been an area of interest in
177+ meeting the fast-growing demand for computing power by deep neural
178+ networks. As a typical DSA design targeting image, video, voice, and
179+ text processing, neural network processing units (or namely deep
180+ learning hardware accelerators) are system-on-chips (SoCs) containing
181+ special compute units, large memory units, and the corresponding control
182+ units. A neural processing unit, for example, Ascend chip, typically
183+ consists of a control CPU, a number of AI computing engines, multi-level
184+ on-chip caches or buffers, and the digital vision pre-processing (DVPP)
185+ module.
186+
187+ The computing core of AI chips is composed of AI Core, which is
188+ responsible for executing scalar- and tensor-based arithmetic-intensive
189+ computing. Consider the Ascend chip as an example. Its AI Core adopts
190+ the Da Vinci architecture.
191+ Figure :numref:` ch06/ch06-davinci_architecture ` shows the architecture
192+ of an AI Core, which can be regarded as a simplified version of modern
193+ microprocessor architecture from the control perspective. It includes
194+ three types of basic computing units: Cube Unit, Vector Unit, and Scalar
195+ Unit. These units are used to compute on tensors, vectors, and scalars,
196+ respectively, in three independent pipelines centrally scheduled through
197+ the system software to coordinate with each other for higher efficiency.
198+ Similar to GPU designs, the Cube Unit functions as the computational
199+ core of the AI Core and delivers parallel acceleration for matrix
200+ multiply-accumulate operations. Specifically, it can multiply two
201+ $16\times16$ matrices in a single instruction --- equivalent to
202+ completing 4096 (=$16\times16\times16$) multiply-accumulate operations
203+ within an extremely short time --- with precision comparable to FP16
204+ operations.
0 commit comments