Skip to content

Commit 8e27336

Browse files
committed
upload section
1 parent cdf2439 commit 8e27336

File tree

1 file changed

+295
-0
lines changed

1 file changed

+295
-0
lines changed
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# Operator Compiler {#sec:operator-compiler}
2+
3+
Operator compilers are used for compiling and optimizing operators,
4+
which may be part of a neural network or come from the code implemented
5+
in a domain-specific language (DSL). The compilation is the process of
6+
*transforming* the source code from one *representation* into another.
7+
8+
The objective of an operator compiler is to improve the *execution
9+
performance* of operators. An operator compiler accepts tensor
10+
computation logic described in *dynamic languages* (e.g., Python) as the
11+
input and outputs executable files on *specific AI processors*.
12+
13+
## Scheduling Strategy
14+
15+
An operator compiler abstracts the execution of statements in an
16+
operator implementation into \"scheduling strategies\". Since an
17+
operator typically consists of multiple statements, the focus lies in
18+
determining the scheduling strategy for the statements within the
19+
operator. This strategy encompasses considerations such as the
20+
calculation order, data block movement, and other relevant factors.
21+
22+
If ignoring the specific processor architecture, for the best
23+
performance, we only need to load all input tensors to the computation
24+
core based on the *computational logic* of the operator and access the
25+
result from the core for storage. *Computational logic* refers to basic
26+
arithmetic operations (e.g., addition, subtraction, multiplication, and
27+
division) and other function expressions (e.g., convolution,
28+
transposition, and loss functions).
29+
30+
Modern computer memory hierarchy looks like a pyramid structure, as
31+
shown in Figure
32+
:numref:`ch05/ch05-memory_architecture`. As we move up the
33+
pyramid, the storage elements have a higher cost but a faster access
34+
time.
35+
36+
![Modern computer memoryhierarchy](../img/ch05/memory_architecture.png)
37+
:label:`ch05/ch05-memory_architecture`
38+
39+
Such hardware design leads to two basic types of locality:
40+
41+
\(1\) Temporal locality: the tendency to access the same memory location
42+
several times in quick succession. As such, accessing the same location
43+
in the L1 cache several times is more efficient than accessing different
44+
locations in the L1 cache several times.
45+
46+
\(2\) Spatial locality: the tendency to access nearby memory locations
47+
in quick succession. As such, accessing nearby locations in the L1 cache
48+
several times is more efficient than moving back and forth between the
49+
L1 cache and the main memory.
50+
51+
Both types of locality help improve system performance. Specifically, in
52+
order to improve the data access speed, data to be repeatedly processed
53+
can be placed in fixed nearby memory locations when possible.
54+
55+
For a serial computational task, it is also possible to decouple the
56+
data part from the logic part and generate a range of independent groups
57+
of data that can be executed in parallel, as shown in Figure
58+
:numref:`ch05/ch05-parallel_computing`.
59+
60+
![Serial computing and parallelcomputing](../img/ch05/parallel_computing.png)
61+
:label:`ch05/ch05-parallel_computing`
62+
63+
These specific data-oriented operations performed at program runtime are
64+
referred to as *schedules*. A schedule defines the following aspects:
65+
66+
\(1\) When and where should each value in a function be calculated?
67+
68+
\(2\) Where is data stored?
69+
70+
\(3\) How long does it take to access each value between those
71+
calculated using preorder structure consumers? And when is independent
72+
recomputation performed by each such value?
73+
74+
Simply put, a scheduling strategy is defined by a set of algorithms
75+
designed during compilation based on the characteristics of target
76+
hardware architecture to improve locality and parallelism. The purpose
77+
of this is to ensure that the resulting executable file delivers optimal
78+
performance at runtime. These algorithms have no effect on the
79+
computation result; instead, they only adjust the computation process in
80+
order to shorten the computation time.
81+
82+
## Combining Scheduling Strategies
83+
84+
In the realm of operator compilers, a common optimization technique
85+
involves combining multiple abstracted scheduling strategies into a
86+
comprehensive and efficient scheduling set through manual template
87+
matching. However, this approach may not be fine-tuned and can be
88+
labor-intensive when applied to achieve refined optimization across
89+
different operators. To illustrate this, let's consider an optimization
90+
algorithm implemented in the Tensor Virtual Machine (TVM). It
91+
accelerates and optimizes a multiply-accumulate code segment on the CPU
92+
by combining several fundamental scheduling strategies.
93+
94+
In Code `lst:before_tvm`, the basic computational logic is as
95+
follows: Initialize tensor C, multiply tensor A by tensor B, and
96+
accumulate the results to tensor C.
97+
98+
**lst:before_tvm**
99+
```
100+
for (m: int32, 0, 1024) {
101+
for (n: int32, 0, 1024) {
102+
C[((m*1024) + n)] = 0f32
103+
for (k: int32, 0, 1024) {
104+
let cse_var_2: int32 = (m*1024)
105+
let cse_var_1: int32 = (cse_var_2 + n)
106+
C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + k)]*B[((k*1024) + n)]))
107+
}
108+
}
109+
}
110+
```
111+
112+
Assuming that the data type is float and that tensors A, B, and C are of
113+
size 1024 $\times$ 1024, then the total memory required by the tensors
114+
is 1024 $\times$ 1024 $\times$ 3 $\times$ sizeof(float) = 12 MB. This
115+
far exceeds the capacity of common caches (e.g., the L1 cache is 32 KB).
116+
Therefore, if we want to compute on Tensor A, B, and C in a single
117+
operation, we must store them in the main memory. However, the main
118+
memory is distant from the compute core, resulting in significantly
119+
lower access efficiency compared to using the cache for storage.
120+
121+
There are several scheduling strategies that can help improve
122+
performance: tile, reorder, and split. The size of the L1 cache is 32
123+
KB. To ensure that data used in every computation step is stored in the
124+
cache, tiling based on the factors of 32 is performed. In this way, only
125+
the tiny block formed by `m.inner `$\times$` n.inner` needs to be taken
126+
into account, and memory access of the innermost tiny block is
127+
independent of the outer loops. A tiny block will occupy only 32
128+
$\times$ 32 $\times$ 3 $\times$ sizeof(float), which is 12 KB in the
129+
cache. The optimized code is shown in Code
130+
`lst:after_tvm`. We perform tiling on loops m and n based on
131+
factor 32 as the previous analysis. Similarly, we tile the loop k based
132+
on factor 4, then reorder the k.outer and k.inner axis as the outermost
133+
axis.
134+
135+
**lst:after_tvm**
136+
```
137+
// Obtain an outer loop by tiling for (m: int32, 0, 1024) based on factor 32.
138+
for (m.outer: int32, 0, 32) {
139+
// Obtain an outer loop by tiling for (n: int32, 0, 1024) based on factor 32.
140+
for (n.outer:
141+
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
142+
for (m.inner.init: int32, 0, 32) {
143+
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
144+
for (n.inner.init: int32, 0, 32) {
145+
// Obtain the corresponding factors.
146+
C[((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)) + n.inner.init)] = 0f32
147+
}
148+
}
149+
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
150+
for (k.outer: int32, 0, 256) {
151+
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
152+
for (k.inner: int32, 0, 4) {
153+
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
154+
for (m.inner: int32, 0, 32) {
155+
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
156+
for (n.inner: int32, 0, 32) {
157+
// Outer axis factor obtained by tiling along axis n
158+
let cse_var_3: int32 = (n.outer*32)
159+
// Outer axis & inner axis factors obtained by tiling along axis m
160+
let cse_var_2: int32 = ((m.outer*32768) + (m.inner*1024))
161+
// Outer axis & inner axis factors obtained by tiling along axes m & n
162+
let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + n.inner)
163+
// Split the computational logic into different layers so that data involved every loop can be stored in the cache.
164+
C[cse_var_1] = (C[cse_var_1] + (A[((cse_var_2 + (k.outer*4)) + n.inner)] * B[((((k.outer*4096) + (k.inner*1024)) + cse_var_3) + n.inner)]))
165+
}
166+
}
167+
}
168+
}
169+
}
170+
}
171+
```
172+
173+
## Finding Optimized Strategies with Polyhedral Models
174+
175+
Another optimization approach is to automatically select an operator
176+
schedule from a schedule search space. A good example of this idea is
177+
the polyhedral compilation. They improve the generalization of operator
178+
compilation at the expense of prolonged compile time.
179+
180+
Polyhedral compilation mainly optimizes the loops in user code by
181+
abstracting each loop into a multidimensional space, computing instances
182+
into points in the space, and dependencies between the instances into
183+
lines in the space. The main idea of this algorithm is to model the
184+
memory access characteristics in code and adjust the execution order of
185+
each instance within each loop. In this way, it aims to enable better
186+
locality and parallelism of the loop code under the new schedule.
187+
188+
Code `lst:before_poly` is used as an example to describe the
189+
algorithm.
190+
191+
**lst:before_poly**
192+
```
193+
for (int i = 0; i < N; i++)
194+
for (int j = 1; j < N; j++)
195+
a[i+1][j] = a[i][j+1] - a[i][j] + a[i][j-1];
196+
```
197+
198+
As shown in Figure :numref:`ch05/ch05-poly_test`, a memory access structure is first
199+
modeled by using the polyhedral model algorithm, and then dependencies
200+
(denoted by arrows) between instances (denoted by nodes) are analyzed.
201+
202+
![Polyhedral model of the samplecode](../img/ch05/poly_test.png)
203+
:label:`ch05/ch05-poly_test`
204+
205+
Complex dependency analysis and schedule transformation are then
206+
performed to obtain an optimal solution that fits the memory model.
207+
Using the polyhedral model algorithm, the code is optimized to that
208+
shown in Code `lst:after_poly`.
209+
210+
**lst:after_poly**
211+
```
212+
for (int i_new = 0; i_new < N; i_new++)
213+
for (int j_new = i+1; j_new < i+N; j_new++)
214+
a[i_new+1][j_new-i_new] = a[i_new][j_new-i_new+1] - a[i_new][j_new-i_new] + a[i_new][j_new-i_new-1];
215+
```
216+
217+
The resulting code looks relatively complex. We can model the code (as
218+
shown in Figure :numref:`ch05/ch05-poly`) to determine its performance
219+
improvements. Through dependency analysis, we find that the loop
220+
dependencies present in the source code are removed in the optimized
221+
code, thereby increasing the opportunities for parallel computing.
222+
Specifically, parallel computing is possible when the loop dependencies
223+
are partitioned along the dashed lines based on the green blocks, as
224+
shown in Figure :numref:`ch05/ch05-poly`.
225+
226+
![Optimization result with the polyhedralmodel](../img/ch05/poly.png)
227+
:label:`ch05/ch05-poly`
228+
229+
We have only introduced the Polyhedral Compilation technique in this
230+
section. However, there are other optimization techniques available,
231+
such as Ansor, which is a heuristic searching method with pruning.
232+
233+
## Adaptation to Instruction Sets
234+
235+
We have previously explored the optimization techniques of operator
236+
compilers. In this section, we build on this foundation to examine how
237+
operator compilers adapt to instruction sets on different chips.
238+
Typically, a general-purpose compiler is designed to be compatible with
239+
as many backend architectures and instruction sets as possible. However,
240+
this can present challenges when the compiler must handle backends with
241+
different architectures and instruction sets.
242+
243+
Two common programming models adopted by AI processors are single
244+
instruction, multiple data (SIMD) and single instruction, multiple
245+
threads (SIMT). As shown in Figures
246+
:numref:`ch05/ch05-SIMD` and
247+
:numref:`ch05/ch05-SIMT`, respectively, SIMD corresponds to chips
248+
with vector instructions, while SIMT corresponds to chips that support
249+
multiple threads. Recently, some chips have begun to combine both
250+
programming models in order to support both multithreaded parallel
251+
computing and vector instructions. When handling different programming
252+
models, an operator compiler adopts different optimization strategies,
253+
such as vectorization.
254+
255+
![SIMD diagram](../img/ch05/SIMD.png)
256+
:label:`ch05/ch05-SIMD`
257+
258+
![SIMT diagram](../img/ch05/SIMT.png)
259+
:label:`ch05/ch05-SIMT`
260+
261+
Operator compilers place a strong emphasis on differentiated support in
262+
the frontend, midend, and backend. In the frontend, support for multiple
263+
backend instruction sets is added, allowing AI programmers to focus on
264+
algorithm logic without having to worry about chip differences. In the
265+
midend, the architectures of different chips are identified, which
266+
allows for specific optimization methods to be implemented for each
267+
chip. When generating backend code, the instruction sets of different
268+
chips are further identified to ensure efficient execution on target
269+
chips.
270+
271+
## Expression Ability
272+
273+
The representation capability of an operator compiler is important
274+
because it determines how well the frontend can express the input code
275+
in an IR without loss of syntax information. The frontend of an operator
276+
compiler is often fed with code programmed in flexible languages (e.g.,
277+
PyTorch code written in Python). However, flexible expressions (e.g.,
278+
indexing and view syntax in Python) pose high requirements on the
279+
frontend expression ability of operator compilers. From the model
280+
perspective, managing the inputs of an operatorn often contain many
281+
control flow statements. Also, some models allow for dynamic-shape
282+
operators whose shapes vary with control flow decisions across
283+
iterations.
284+
285+
Additionally, there are a large number of operators that may not have
286+
optimized implementation provided by the accelerator libraries (e.g.,
287+
cuDNN) directly. This phenomenon is referred to as long tail operators.
288+
However, the long tail operators can have highly flexible syntax or
289+
abundant control flow statements and sometimes support dynamic shapes,
290+
making it extremely difficult for the frontend of existing operator
291+
compilers to express, optimize, or accelerate them. Consequently, such
292+
operators have to be executed by the Python interpreter or slow virtual
293+
machines, leading to a performance bottleneck in network execution. This
294+
is why it is imperative to improve the expression ability of the
295+
operator compiler frontend.

0 commit comments

Comments
 (0)