Skip to content

Commit cdf2439

Browse files
committed
Upload section
1 parent b24fc53 commit cdf2439

File tree

1 file changed

+214
-0
lines changed

1 file changed

+214
-0
lines changed
Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Operator Selection
2+
3+
Following graph optimization, the compiler backend generates a sequence
4+
of operators that can be executed on hardware. This is achieved by
5+
selecting the most suitable operators from a set of candidate operators
6+
for each node in the IR. Since these candidate operators have diverse
7+
specifications, their execution efficiency varies depending on the
8+
scenario. Therefore, the primary objective of operator selection is to
9+
choose the operators that are most appropriate for the target device
10+
based on the information provided by the IR.
11+
12+
## Basic Concepts of Operator Selection
13+
14+
We can think of the nodes in a backend-optimized IR as being units of
15+
execution that are visible to the user, and each unit represents a
16+
hardware-agnostic operation in the user code. In essence, operator
17+
selection involves selecting appropriate hardware information, which is
18+
referred to as operator information. Such information defines the
19+
following:
20+
21+
1. The format of an operator, which is a determinant of the operator's
22+
performance on the target platform. Machine learning systems
23+
commonly use NCHW and NHWC formats.
24+
25+
2. The data type (such as float32, float16, or int32) of an operator on
26+
the target platform. The operators selected are those with data
27+
types close to (or the same as) user definitions.
28+
29+
### Data Formats
30+
31+
In machine learning systems, many operations are converted into matrix
32+
multiplication (e.g., convolution) for faster computation. Matrix
33+
multiplication in the form of
34+
$\textit{\textit{A}}\times \textit{\textit{B}} = \textit{\textit{C}}$ is
35+
essentially a row-by-column multiplication. Specifically, the entry *ij*
36+
of **C** is obtained by multiplying the entries in the *i*th row of
37+
**A** and the corresponding entries in the *j*th column of **B** and
38+
then adding the results together. Consider the example shown in Figure
39+
:numref:`ch07/ch07-compiler-backend-06`. Matrix data is stored in
40+
row-major order by default, as shown at the top of the figure. However,
41+
matrix **B** is read in column-major order in the matrix multiplication
42+
process, as shown at the bottom.
43+
44+
![Matrix data layouts in matrixmultiplication](../img/ch07/matmuldatalayout.png)
45+
:label:`ch07/ch07-compiler-backend-06`
46+
47+
Storing matrix **B** in the reading order increases the computation
48+
efficiency because access to contiguous blocks of memory is faster. We
49+
can therefore see that data formats play an important role in
50+
performance improvement.
51+
52+
There are two major formats in machine learning systems: NCHW and NHWC.
53+
For an image input, N denotes the batch size, C denotes the number of
54+
channels, and H and W denote the height and width respectively. Figure
55+
:numref:`ch07/ch07-compiler-backend-07` depicts the logical
56+
diagram of an input with batch size 2, channels 16, height 5, and width
57+
4.
58+
59+
![Formatdiagram](../img/ch07/data_format.png)
60+
:label:`ch07/ch07-compiler-backend-07`
61+
62+
A multidimensional matrix is flattened into 1D format before it is
63+
written to memory. This involves indexing, which maps logical data to
64+
physical memory.
65+
66+
Access to machine learning data is performed in an axis-wise order from
67+
the last axis forward. For instance, data in NCHW format is read in the
68+
axis order of W, H, C, and N. Equation
69+
:eqref:`ch05/equation-01` denotes the mapping between
70+
logical memory and physical memory for this format of data.
71+
72+
$$
73+
\text{offsetnchw}(n,c,h,w) = n \times \textit{C} \times \textit{H} \times \textit{W} + c \times \textit{H} \times \textit{W} + h \times \textit{W} + w
74+
$$
75+
:eqlabel:`equation:ch05/equation-01`
76+
77+
As shown in Figure
78+
:numref:`ch07/ch07-compiler-backend-08`, matrix elements are
79+
flattened from the lowest dimension (i.e., W axis) forward, and
80+
neighboring elements of an axis reside next to each other in memory. To
81+
take the same element on the next image in the same location, the whole
82+
image size ($C*H*W$) has to be jumped. Assume we have a batch of eight
83+
RGB images of size 32$\times$`<!-- -->`{=html}32, or a matrix with
84+
$N=8,C=3,H=32,W=32$. Memory storage of these images begins from the
85+
first channel of the first image by flattening the matrix along axis W
86+
and then arranging matrix elements along axis H. This is performed
87+
before the next channel is processed. The same procedure is repeated
88+
until the last channel of the last image is processed. NCHW is the
89+
default format on PyTorch and MindSpore.
90+
91+
![RGB image data in NHWCformat](../img/ch07/nchw.png)
92+
:label:`ch07/ch07-compiler-backend-08`
93+
94+
Access to data in NHWC format also begins at the lowest dimension (i.e.,
95+
C axis) forward. NHWC is the default format on TensorFlow (PyTorch
96+
refers to it as the channel-last format). Equation
97+
:eqref:`ch05/equation-02` denotes the mapping from logical
98+
memory to physical memory for this format of data.
99+
100+
$$
101+
\text{offsetnchw}(n,h,w,c) = n \times \textit{H} \times \textit{W} \times \textit{C} + h \times \textit{W} \times \textit{C} + w \times \textit{C} + c
102+
$$
103+
:eqlabel:`equation:ch05/equation-02`
104+
105+
Figure
106+
:numref:`ch07/ch07-compiler-backend-nchwandnhwc` compares the
107+
logical indexing of the NCHW and NHWC formats. The \[x:1\] marks refer
108+
to the jumps from the innermost axis to the next. For example, \[a:1\]
109+
indicates the jump from axis W to axis H, and \[b:1\] indicates the jump
110+
from axis C (the innermost) to axis W.
111+
112+
![NCHW and NHWCformats](../img/ch07/nchwandnhwc.png)
113+
:label:`ch07/ch07-compiler-backend-nchwandnhwc`
114+
115+
These two formats offer a high degree of flexibility and are therefore
116+
used on many frameworks. However, to accelerate computing on hardware,
117+
further optimization is needed. In a machine learning system, if the
118+
size of the user input exceeds what the compute component can pass
119+
through the network at a time (which is often the case), the input will
120+
be batched before computation. For further optimization, many frameworks
121+
introduce blocked formats (which are more hardware-friendly), such as
122+
the nChw16c and nChw8c formats of the oneAPI Deep Neural Network Library
123+
(oneDNN) and the NC1HWC0 format on the Ascend platform. By leveraging
124+
hardware acceleration instructions to move and compute data, matrices
125+
can be quickly transformed into vectors, increasing the utilization of
126+
the on-chip cache.
127+
128+
### Data Types
129+
130+
Single-precision (float32), occupying 32 bits in memory, is the most
131+
commonly used data type in machine learning systems. In applications
132+
where higher precision is not essential, the half-precision (float16)
133+
data type may be used, occupying 16 bits in memory. When used on
134+
hardware, float16 offers up to 7 times more arithmetic throughput with
135+
less memory footprint compared with the single-precision data type ---
136+
this allows for larger batch sizes and consequently reduced training
137+
time. Next, we will look at the differences between half-precision
138+
floating-point numbers and single-precision floating-point numbers.
139+
140+
In Figure :numref:`ch07/ch07-float32andfloat16`, *Sig* refers to the sign
141+
bit that indicates the sign of a number, *Exponent* refers to the
142+
exponent bits, and *Mantissa* refers to the mantissa bits.
143+
144+
![Binary representation of floating-pointnumbers](../img/ch07/floatdtype.png)
145+
:label:`ch07/ch07-float32andfloat16`
146+
147+
Applying Equation
148+
:eqref:`ch05/equation-03` will convert a float16 number in
149+
binary scientific notation to decimal format.
150+
151+
$$
152+
(-1)^{\text{Sig}}\times 2^{\text{Exponent}-15}\times (\frac{\text{Mantissa}}{1024}+1)
153+
$$
154+
:eqlabel:`equation:ch05/equation-03`
155+
156+
If the exponent bits and mantissa bits are all 0s, the number is 0. If
157+
the exponent bits are all 0s but the mantissa bits are not, the number
158+
is very small. If the exponent bits are all 1s and the mantissa bits are
159+
all 0s, the number is an infinity, either positive or negative depending
160+
on the sign bit. Not a Number (NaN) is denoted by the exponent bits
161+
being all 1s while the mantissa bits are not all 0s. bfloat16 is a
162+
special data type developed by Google for machine learning on its tensor
163+
processing units (TPUs). Although bfloat16 is not an industry-standard
164+
IEEE 16-bit floating-point data type, it has the same exponent size as
165+
float32, meaning that it can be easily converted to and from float32.
166+
167+
### Operator Information Library
168+
169+
Hardware devices support different operators based on their data format
170+
and data type requirements. Each device maintains an operator
171+
information library that contains a comprehensive list of operators
172+
supported by that device. During the operator selection process, the
173+
most suitable operators are chosen from this library. The library serves
174+
as a reference for determining which operators are compatible and can be
175+
efficiently executed on a particular hardware device.
176+
177+
## Process of Operator Selection
178+
179+
Operator selection involves selecting the most appropriate operator for
180+
each operation node in an IR. Operator information contains the
181+
supported device type, data type, and data format. After the compiler
182+
frontend completes type inference and static analysis, the data type of
183+
user code is derived from the IR.
184+
185+
Figure :numref:`ch07/ch07-compiler-backend-select` shows the operator
186+
selection process. First, the target hardware needs to be selected (or
187+
this step can be skipped in order to keep the default hardware selection
188+
defined in the compiler backend). The implementation, supported data
189+
types, and execution efficiency of a given operator vary depending on
190+
the target hardware. Then, the compiler backend selects an operator
191+
based on the data type and data format derived from the IR.
192+
193+
![Operator selection process (using GPU as anexample)](../img/ch07/select_kernel.png)
194+
:label:`ch07/ch07-compiler-backend-select`
195+
196+
The result of the operator selection process might not be as expected
197+
due to software or hardware specifications. Sometimes, we might need to
198+
adjust the precision of a particular node to find an operator with the
199+
right data type. For example, the Conv2D operator supported by Ascend
200+
(i.e., the backend of MindSpore) allows only the float16 data type. When
201+
used on a float32 network on Ascend, the Conv2D operator is executable
202+
only when its input precision is reduced from float32 to float16.
203+
204+
Converting operators from one format to another can be time-consuming
205+
and incur memory movement overheads. To avoid this, data should be
206+
transferred between operators of the same format whenever possible. In
207+
addition, data type inconsistency may lead to reduced precision,
208+
potentially slowing down or even preventing network convergence. As
209+
such, thorough operator analysis is needed to ensure that the right data
210+
type is selected.
211+
212+
Simply put, an operator selection algorithm is considered optimal if it
213+
keeps the data type as consistent as possible with user settings while
214+
also minimizing data format conversion.

0 commit comments

Comments
 (0)