|
| 1 | +# Components of Hardware Accelerators |
| 2 | + |
| 3 | +A hardware accelerator typically comprises multiple on-chip caches and |
| 4 | +various types of arithmetic units. In this section, we'll examine the |
| 5 | +fundamental components of hardware accelerators, using the Nvidia Volta |
| 6 | +GPU architecture as a representative example. |
| 7 | + |
| 8 | +## Architecture of Accelerators |
| 9 | + |
| 10 | +Contemporary graphics processing units (GPUs) offer remarkable computing |
| 11 | +speed, ample memory storage, and impressive I/O bandwidth. A top-tier |
| 12 | +GPU frequently surpasses a conventional CPU by housing double the number |
| 13 | +of transistors, boasting a memory capacity of 16 GB or greater, and |
| 14 | +operating at frequencies reaching up to 1 GHz. The architecture of a GPU |
| 15 | +comprises streaming processors and a memory system, interconnected |
| 16 | +through an on-chip network. These components can be expanded |
| 17 | +independently, allowing for customized configurations tailored to the |
| 18 | +target market of the GPU. |
| 19 | + |
| 20 | +Figure :numref:`ch06/ch06-gv100` illustrates the architecture of the |
| 21 | +Volta GV100 . This architecture has: |
| 22 | + |
| 23 | + |
| 24 | +:label:`ch06/ch06-gv100` |
| 25 | + |
| 26 | +1. 6 GPU processing clusters (GPCs), each containing: |
| 27 | + |
| 28 | + 1. 7 texture processing clusters (TPCs), each containing two |
| 29 | + streaming multiprocessors (SMs). |
| 30 | + |
| 31 | + 2. 14 SMs. |
| 32 | + |
| 33 | +2. 84 SMs, each containing: |
| 34 | + |
| 35 | + 1. 64 32-bit floating-point arithmetic units |
| 36 | + |
| 37 | + 2. 64 32-bit integer arithmetic units |
| 38 | + |
| 39 | + 3. 32 64-bit floating-point arithmetic units |
| 40 | + |
| 41 | + 4. 8 Tensor Cores |
| 42 | + |
| 43 | + 5. 4 texture units |
| 44 | + |
| 45 | +3. 8 512-bit memory controllers. |
| 46 | + |
| 47 | +As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming |
| 48 | +Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376 |
| 49 | +32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic |
| 50 | +units, 672 Tensor Cores, and 336 texture units. A pair of memory |
| 51 | +controllers controls an HBM2 DRAM stack. Different vendors may use |
| 52 | +different configurations (e.g., Tesla V100 has 80 SMs). |
| 53 | + |
| 54 | +## Memory Units |
| 55 | + |
| 56 | +The memory units of a hardware accelerator resemble a CPU's memory |
| 57 | +controller. However, they encounter a bottleneck when retrieving data |
| 58 | +from the computer system's DRAM, as it is slower compared to the |
| 59 | +processor's computational speed. Without a cache for quick access, the |
| 60 | +DRAM bandwidth becomes inadequate to handle all transactions of the |
| 61 | +accelerator. Consequently, if program instructions or data cannot be |
| 62 | +swiftly retrieved from the DRAM, the accelerator's efficiency diminishes |
| 63 | +due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs |
| 64 | +employ a hierarchical design of memory units. Each type of memory unit |
| 65 | +offers its own maximum bandwidth and latency. To fully exploit the |
| 66 | +computing power and enhance processing speed, programmers must select |
| 67 | +from the available memory units and optimize memory utilization based on |
| 68 | +varying access speeds. |
| 69 | + |
| 70 | +1. **Register file**: Registers serve as the swiftest on-chip memories. |
| 71 | + In contrast to CPUs, each SM in a GPU possesses tens of thousands of |
| 72 | + registers. Nevertheless, excessively utilizing registers for every |
| 73 | + thread can result in a reduced number of thread blocks that can be |
| 74 | + scheduled within the SM, leading to fewer executable threads. This |
| 75 | + underutilization of hardware capabilities hampers performance |
| 76 | + considerably. Consequently, programmers must judiciously determine |
| 77 | + the appropriate number of registers to employ, taking into account |
| 78 | + the algorithm's demands. |
| 79 | + |
| 80 | +2. **Shared memory**: The shared memory is a level-1 cache that is |
| 81 | + user-controllable. Each SM features a 128 KB level-1 cache, with the |
| 82 | + ability for programmers to manage up to 96 KB as shared memory. The |
| 83 | + shared memory offers a low access latency, requiring only a few |
| 84 | + dozen clock cycles, and boasts an impressive bandwidth of up to 1.5 |
| 85 | + TB/s. This bandwidth is significantly higher than the peak bandwidth |
| 86 | + of the global memory, which stands at 900 GB/s. In high-performance |
| 87 | + computing (HPC) scenarios, engineers must possess a thorough |
| 88 | + understanding of how to leverage shared memory effectively. |
| 89 | + |
| 90 | +3. **Global memory**: Both GPUs and CPUs are capable of reading from |
| 91 | + and writing to global memory. Global memory is visible and |
| 92 | + accessible by all threads on a GPU, whereas other devices like CPUs |
| 93 | + need to traverse buses like PCIe and NV-Link to access the global |
| 94 | + memory. The global memory represents the largest memory space |
| 95 | + available in a GPU, with capacities reaching over 80 GB. However, it |
| 96 | + also exhibits the longest memory latency, with a load/store latency |
| 97 | + that can extend to hundreds of clock cycles. |
| 98 | + |
| 99 | +4. **Constant memory**: The constant memory is a virtual address space |
| 100 | + in the global memory and does not occupy a physical memory block. It |
| 101 | + serves as a high-speed memory, specifically designed for rapid |
| 102 | + caching and efficient broadcasting of a single value to all threads |
| 103 | + within a warp. |
| 104 | + |
| 105 | +5. **Texture memory**: Texture memory is a specialized form of global |
| 106 | + memory that is accessed through a dedicated texture cache to enhance |
| 107 | + performance. In earlier GPUs without caches, the texture memory on |
| 108 | + each SM served as the sole cache for data. However, the introduction |
| 109 | + of level-1 and level-2 caches in modern GPUs has rendered the |
| 110 | + texture memory's role as a cache obsolete. The texture memory proves |
| 111 | + most beneficial in enabling GPUs to execute hardware-accelerated |
| 112 | + operations while accessing memory units. For instance, it allows |
| 113 | + arrays to be accessed using normalized addresses, and the retrieved |
| 114 | + data can be automatically interpolated by the hardware. |
| 115 | + Additionally, the texture memory supports both hardware-accelerated |
| 116 | + bilinear and trilinear interpolation for 2D and 3D arrays, |
| 117 | + respectively. Moreover, the texture memory facilitates automatic |
| 118 | + handling of boundary conditions based on array indices. This means |
| 119 | + that operations on array elements can be carried out without |
| 120 | + explicit consideration of boundary situations, thus avoiding the |
| 121 | + need for extra conditional branches in a thread. |
| 122 | + |
| 123 | +## Compute Units {#Compute Units} |
| 124 | + |
| 125 | +Hardware accelerators offer a variety of compute units to efficiently |
| 126 | +handle various neural networks. |
| 127 | +Figure :numref:`ch06/ch06-compute-unit` demonstrates how different |
| 128 | +layers of neural networks select appropriate compute units. |
| 129 | + |
| 130 | + |
| 131 | +:label:`ch06/ch06-compute-unit` |
| 132 | + |
| 133 | +1. **Scalar Unit**: calculates one scalar element at a time, similar to |
| 134 | + the standard reduced instruction set computer (RISC). |
| 135 | + |
| 136 | +2. **1D Vector Unit**: computes multiple elements at a time, similar to |
| 137 | + the SIMD used in traditional CPU and GPU architectures. It has been |
| 138 | + widely used in HPC and signal processing. |
| 139 | + |
| 140 | +3. **2D Matrix Unit**: computes the inner product of a matrix and a |
| 141 | + vector or the outer product of a vector within one operation. It |
| 142 | + reuses data to reduce communication costs and memory footprint, |
| 143 | + which achieves the performance of matrix multiplication. |
| 144 | + |
| 145 | +4. **3D Cube Unit**: completes matrix multiplication within one |
| 146 | + operation. Specially designed for neural network applications, it |
| 147 | + can reuse data to compensate for the gap between the data |
| 148 | + communication bandwidth and computing. |
| 149 | + |
| 150 | +The compute units on a GPU mostly include Scalar Units and 3D Cube |
| 151 | +Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point |
| 152 | +arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit |
| 153 | +floating-point arithmetic units, which are Scalar Units, and 8 Tensor |
| 154 | +Cores, which are 3D Cube Units specially designed for neural network |
| 155 | +applications. |
| 156 | + |
| 157 | + |
| 158 | +:label:`ch06/ch06-SM` |
| 159 | + |
| 160 | +A Tensor Core is capable of performing one $4\times4$ matrix |
| 161 | +multiply-accumulate operation per clock cycle, as shown in |
| 162 | +Figure :numref:`ch06/ch06-tensorcore`. |
| 163 | + |
| 164 | + D = A * B + C |
| 165 | + |
| 166 | + |
| 167 | +:label:`ch06/ch06-tensorcore` |
| 168 | + |
| 169 | +$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices. |
| 170 | +Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation |
| 171 | +matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices. |
| 172 | +Tesla V100's Tensor Cores are programmable matrix multiply-accumulate |
| 173 | +units that can deliver up to 125 Tensor Tera Floating-point Operations |
| 174 | +Per Second (TFLOPS) for training and inference applications, resulting |
| 175 | +in a ten-fold increase in computing speed when compared with common FP32 |
| 176 | +compute units. |
| 177 | + |
| 178 | +## Domain Specific Architecture |
| 179 | + |
| 180 | + |
| 181 | +:label:`ch06/ch06-davinci_architecture` |
| 182 | + |
| 183 | +Domain Specific Architecture (DSA) has been an area of interest in |
| 184 | +meeting the fast-growing demand for computing power by deep neural |
| 185 | +networks. As a typical DSA design targeting image, video, voice, and |
| 186 | +text processing, neural network processing units (or namely deep |
| 187 | +learning hardware accelerators) are system-on-chips (SoCs) containing |
| 188 | +special compute units, large memory units, and the corresponding control |
| 189 | +units. A neural processing unit, for example, Ascend chip, typically |
| 190 | +consists of a control CPU, a number of AI computing engines, multi-level |
| 191 | +on-chip caches or buffers, and the digital vision pre-processing (DVPP) |
| 192 | +module. |
| 193 | + |
| 194 | +The computing core of AI chips is composed of AI Core, which is |
| 195 | +responsible for executing scalar- and tensor-based arithmetic-intensive |
| 196 | +computing. Consider the Ascend chip as an example. Its AI Core adopts |
| 197 | +the Da Vinci architecture. |
| 198 | +Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture |
| 199 | +of an AI Core, which can be regarded as a simplified version of modern |
| 200 | +microprocessor architecture from the control perspective. It includes |
| 201 | +three types of basic computing units: Cube Unit, Vector Unit, and Scalar |
| 202 | +Unit. These units are used to compute on tensors, vectors, and scalars, |
| 203 | +respectively, in three independent pipelines centrally scheduled through |
| 204 | +the system software to coordinate with each other for higher efficiency. |
| 205 | +Similar to GPU designs, the Cube Unit functions as the computational |
| 206 | +core of the AI Core and delivers parallel acceleration for matrix |
| 207 | +multiply-accumulate operations. Specifically, it can multiply two |
| 208 | +$16\times16$ matrices in a single instruction --- equivalent to |
| 209 | +completing 4096 (=$16\times16\times16$) multiply-accumulate operations |
| 210 | +within an extremely short time --- with precision comparable to FP16 |
| 211 | +operations. |
0 commit comments