Skip to content

Commit e9a2ab6

Browse files
committed
Upload replaced sections
1 parent 27c2fe6 commit e9a2ab6

File tree

5 files changed

+357
-1
lines changed

5 files changed

+357
-1
lines changed

chapter_computatioinal_graph/Computational_Graph_Basics.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ tensor.
3838
| dtype | Data type, such as bool, uint8, int16, float32, and float64. |
3939
| device | Target device, such as a CPU or GPU. |
4040
| name | Tensor name. |
41-
:label:ch04/ch4-tensor
41+
:label:`ch04/ch4-tensor`
4242

4343

4444
In the following, we explore each tensor attribute with image data as an
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Computational Graph Functions
2+
3+
Early machine learning frameworks are mainly designed for fully
4+
connected networks and convolutional neural networks (CNNs). Such neural
5+
networks have serial layers, whose topology structures can be
6+
represented in simple configuration files (e.g., Caffe model definition
7+
in Protocol Buffers format).
8+
9+
Conversely, modern machine learning models have ever more complex
10+
structures. Prominent examples include mixture-of-experts (MoE),
11+
generative adversarial network (GAN), and attention models. To improve
12+
the training efficiency with complex model structures (e.g., loops with
13+
branching), machine learning frameworks are expected to quickly analyze
14+
operator dependencies, gradient computation, and training parameters, to
15+
facilitate model optimization, formulate scheduling strategies, and
16+
automate gradient computation. As such, machine learning system
17+
designers call for a common data structure to understand, represent, and
18+
execute machine learning models. To this end, machine learning
19+
frameworks introduce the computational graph technology while still
20+
decoupling the frontend and backend languages in design, as shown in
21+
Figure :numref:`ch04/ch04-DAG`. From a top-level view, computational
22+
graph technology provides the following key functions:
23+
24+
![Computational graph--basedarchitecture](../img/ch04/graph.png)
25+
:label:`ch04/ch04-DAG`
26+
27+
1. **Unified representation of the computation process.** Developers
28+
tend to write machine learning programs in high-level programming
29+
languages (e.g., Python, Julia, and C++). However, because most
30+
devices such as hardware accelerators provide only C/C++ APIs,
31+
implementations of machine learning systems are largely restricted
32+
to C/C++. Computational graph technology makes it possible to run
33+
programs written in different high-level languages on common
34+
low-level C/C++ system modules. As a unified representation, a
35+
computational graph describes a model's input data, computational
36+
logic (usually referred to as operators), and execution sequence of
37+
operators.
38+
39+
2. **Automatic gradient computation.** The training program receives
40+
data samples (or the training dataset), performs forward computation
41+
through the network, and then calculates the loss value. Based on
42+
the loss value, the machine learning system computes the gradient
43+
for each model parameter and then updates the model parameters. The
44+
gradient computation method should apply universally and run
45+
automatically, regardless of the model topology and loss computation
46+
method. Based on the computational graph, the machine learning
47+
system can quickly analyze the gradient transfer relations between
48+
parameters, thereby achieving automatic gradient computation.
49+
50+
3. **Lifetime analysis of model variables.** During model training,
51+
many intermediate variables are generated, for example, the
52+
activation values in the forward pass and the gradients in the
53+
backward pass. Some of the intermediate variables generated in the
54+
forward pass are used in conjunction with the gradients for updating
55+
model parameters. With a computational graph, the machine learning
56+
system can accurately analyze the lifetime of each intermediate
57+
variable (i.e., from the time the variable is generated to the time
58+
it is destroyed), helping the framework optimize memory management.
59+
60+
4. **Execution optimization.** User programs can have different network
61+
structures. With computational graph technology, the machine
62+
learning framework can analyze the model topology and operator
63+
dependencies, and it automatically searches for operator parallel
64+
computing strategies to improve the model execution efficiency.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Further Reading
2+
3+
1. Computational graph technology is fundamentally important to major
4+
machine learning frameworks. For the design details of major machine
5+
learning frameworks, see *TensorFlow: Large-Scale Machine Learning
6+
on Heterogeneous Distributed Systems*[^1], and *Pytorch: An
7+
Imperative Style, High-Performance Deep Learning Library*.
8+
9+
2. Out-of-graph control flows are created using the frontend language,
10+
which are easy to grasp for most programmers. However, implementing
11+
control flows using the in-graph approach could be challenging. For
12+
more on this topic, see *Implementation of Control Flow in
13+
TensorFlow*[^2].
14+
15+
3. For the design and practices of dynamic and static graphs, see
16+
*TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine
17+
Learning*[^3], Eager Execution: An imperative, define-by-run
18+
interface to TensorFlow[^4], Introduction to graphs and
19+
tf.function[^5], and MindSpore Computational Graph[^6].
20+
21+
[^1]: <https://arxiv.org/pdf/1603.04467.pdf>
22+
23+
[^2]: <http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf>
24+
25+
[^3]: <https://arxiv.org/pdf/1903.01855.pdf>
26+
27+
[^4]: <https://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html>
28+
29+
[^5]: <https://www.tensorflow.org/guide/intro_to_graphs>
30+
31+
[^6]: <https://www.mindspore.cn/tutorials/en/master/advanced/compute_graph.html>
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Computational Graph
2+
3+
In this chapter, we look at the following question: How does a machine
4+
learning system efficiently execute such a program on hardware? We can
5+
break this down into three sub-questions: How do we schedule and execute
6+
the model described by a machine learning program? How do we improve the
7+
model scheduling and execution efficiency? And can we implement
8+
automatic gradient computation for updating the model? The key to
9+
answering these questions is computational graph technology. To explain
10+
this technology, this chapter explains the following key aspects:
11+
12+
1. Computational graph basics
13+
14+
2. Generation of static and dynamic computational graphs
15+
16+
3. Common execution methods of computational graphs
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Scheduling and Executing Computational Tasks
2+
3+
Training a model is conducted by scheduling the execution of the
4+
operators in a computational graph. From a broad perspective, a training
5+
job runs a computational graph for a defined number of iterations,
6+
relying on optimal scheduling of tasks such as data loading and training
7+
(inference) execution. Within each iteration, we need to analyze
8+
operator-level scheduling based on the graph topology, computational
9+
dependencies, and control flows. We optimize the scheduling and
10+
execution of computational graphs to make full use of computing
11+
resources, improve computational efficiency, and shorten the model
12+
training and inference time. The following introduces the typical
13+
techniques of computational graph scheduling.
14+
15+
The scheduling execution of the computation graph can be divided into
16+
three modes according to the graph generation method, which are operator
17+
scheduling, whole graph scheduling, and operator and subgraph combined
18+
scheduling. These three modes also correspond to the three modes of
19+
dynamic graph, static graph, and combination of dynamic and static in
20+
the calculation graph generation mechanism.
21+
22+
Next, we will introduce the scheduling and execution of the calculation
23+
graph in detail.
24+
25+
## Operator Scheduling
26+
27+
Operator scheduling means that the operators contained in the algorithm
28+
or model are scheduled and executed one by one through the runtime of
29+
the Python language. This scheduling mechanism is used when the
30+
calculation graph is executed in dynamic graph mode, such as PyTorch's
31+
default execution mode and TensorFlow's eager mode.
32+
33+
Operator scheduling includes two steps. In the first step, according to
34+
the call sequence of the model operator declaration, the dynamic
35+
calculation graph obtains a linear operator scheduling sequence. And the
36+
second is distributing the ordering of operators to instruction streams.
37+
38+
In Figure :numref:`ch04/ch04-diaoduzhixing`, the directed acyclic graph on
39+
the left contains five nodes a, b, c, d, and e and four dependency edges
40+
a-\>d, b-\>c, c-\>d, and d-\>e (e.g., a-\>d indicates that d depends on
41+
a). According to the operator call sequence of the model code, such as
42+
a-\>b-\>c-\>d-\>e, all operator nodes are put into the queue in turn,
43+
and the scheduling ends.
44+
45+
![Operator scheduling andexecution](../img/ch04/schedule.png)
46+
:label:`ch04/ch04-diaoduzhixing`
47+
48+
With the ordering, we then prepare to distribute the operators in the
49+
ordering and related data to the GPU hardware for execution. Figure
50+
:numref:`ch04/ch04-single-op-exec` shows the trace of operator
51+
scheduling. Once the Python runtime calls an operator, the machine
52+
learning framework initializes the operator by determining information
53+
such as the operator precision, type and size of each input/output, and
54+
target device. It then allocates memory for the operator before copying
55+
the memory to the specific device for execution.
56+
57+
![Operator schedulingtrace](../img/ch05/single_op_exec.PNG)
58+
:label:`ch04/ch04-single-op-exec`
59+
60+
The operator scheduling method offers high flexibility because operators
61+
are directly scheduled by the Python runtime. It facilitates the
62+
representation of complex computational logic (such as control flows)
63+
and use of Python-native data structures for implementing complex
64+
algorithms. Operators are driven by the Python runtime to finish
65+
computational tasks, facilitating easy collaboration with Python's
66+
large, rich ecosystem.
67+
68+
Despite its advantages, operator scheduling also has some disadvantages.
69+
One is that context-based runtime optimizations such as operator fusion
70+
and algebraic simplification become difficult. This is because global
71+
information about the computational graph is unavailable. Another
72+
disadvantage is that computational tasks have to run in serial mode,
73+
rather than in parallel, due to the lack of computational topology.
74+
75+
## Graph Scheduling
76+
77+
When the calculation graph uses the static graph mechanism for
78+
whole-graph scheduling execution, operators will be sent to the hardware
79+
for execution one by one according to a certain execution sequence.
80+
However, global information about the computational graph is available.
81+
it can analyze operator dependencies and the number of computing
82+
devices, and complete the scheduling and execution of the entire graph
83+
in the following two ways:
84+
85+
1. **Serial**: executes its tasks one at a time, in the order that they
86+
are added to the queue.This method expands a computational graph
87+
into a sequence of operators, which are then run separately.
88+
Operators are executed in a static order using a single thread,
89+
thereby requiring fewer resources.
90+
91+
2. **Parallel**: executes its tasks concurrently for higher
92+
efficiency.This method expands a computational graph based on
93+
operator dependencies. Operators are executed in the order defined
94+
by their input dependencies, and those without input dependencies
95+
are executed concurrently. This method executes operators in a
96+
dynamic order (which may vary in each iteration) using multiple
97+
threads, thereby consuming more system resources.
98+
99+
Within a computational graph, most operators are dependent on each other
100+
directly or indirectly. When scheduling such operators, their sequence
101+
must be guaranteed. Figure
102+
:numref:`ch04/ch04-diaodu` shows a computational graph, where a
103+
forward pass is run on the input data to produce a predicted value and
104+
then the gradient of the loss function is computed for backpropagation.
105+
In general, downstream operators run dependently on the output from the
106+
upstream. As such, we have to schedule the operators in this
107+
computational graph to a serial queue in order to ensure that each
108+
operator receives the necessary input.
109+
110+
![Serial operatorscheduling](../img/ch04/order.png)
111+
:label:`ch04/ch04-diaodu`
112+
113+
A computational graph may also contain operators independent of each
114+
other, for example, op1 and op2 shown in Figure
115+
:numref:`ch04/ch04-para`. We can have each operator run on
116+
different hardware devices to implement parallel computing. Compared
117+
with the serial mode, parallel computing decreases execution time by
118+
leveraging more computing resources at the same time.
119+
120+
![Parallel operator scheduling](../img/ch04/para.png)
121+
:label:`ch04/ch04-para`
122+
123+
Serial execution and parallel execution have their own advantages and
124+
disadvantages, as summarized in Table
125+
:numref:`ch04/ch4-graph`.
126+
127+
:Comparison between serial execution and parallel execution
128+
129+
| Execution Method | Serial execution | Parallel execution |
130+
|----------------------|------------------|-------------------- |
131+
| Execution Order | Static | Dynamic |
132+
| Execution Threads | Single thread | Multiple threads |
133+
| Resource Consumption | Low | High |
134+
:label:`ch04/ch4-graph`
135+
136+
A computing environment contains more than one type of computing device,
137+
such as a CPU, GPU, or other. As such, a computational graph consisting
138+
of operators that run on more than one type of computing device is
139+
referred to as a heterogeneous computational graph.
140+
141+
The graph contains the following types of operators based on the
142+
computing hardware.
143+
144+
- **CPU operators**: They are C++ operators that run on the host CPU.
145+
The computing performance of the CPU depends on the extent to which
146+
the multi-core capability of the CPU is utilized.
147+
148+
- **GPU operators**: They run on the GPU (e.g., NVIDIA GPU). GPU
149+
kernels are delivered to the host GPU one by one for execution. The
150+
GPU features ample parallel computing units that offer significant
151+
speedup to parallel algorithms.
152+
153+
- **Python operators**: They run on the host CPU. Unlike CPU
154+
operators, Python operators are interpreted and executed by the
155+
Python runtime interpreter.
156+
157+
We mentioned earlier that the dynamic graph mechanism relies on the
158+
Python interpreter to distribute operators and execute them serially
159+
according to the order of operators defined by the model code. This mode
160+
usually allows data to be transmitted on different computing devices.
161+
Communication bottlenecks may increase the time spent waiting for
162+
operators to execute data, reducing the overall execution efficiency of
163+
the calculation graph. Therefore, the first condition for the efficient
164+
execution of the calculation graph is to accurately identify the device
165+
where the operator is executed, try to avoid the transmission of data
166+
between different devices. Independent operators are scheduled on
167+
different devices in parallel. The static graph mechanism can get rid of
168+
the constraints of the Python interpreter. The calculation graph is sent
169+
to the device at one time, which reduces the number of interactions
170+
between the host and the computing chip, and improves computing
171+
efficiency and performance.
172+
173+
The combination of operators and subgraphs for scheduling execution mode
174+
is a combination of the previous two execution modes. Due to the
175+
flexibility of the computing graph structure, the efficiency of
176+
computing graphs in complex scenarios may not be optimal when executed
177+
on the entire computing chip. For example, computing chips can
178+
accelerate floating-point operations, while CPUs are good at processing
179+
logical judgments. Therefore, the parts with low execution efficiency
180+
for computing chips can be separated and handed over to devices with
181+
higher execution efficiency such as CPU for processing, which can take
182+
into account both performance and flexibility.
183+
184+
There are different levels of parallelism: operator parallelism, model
185+
parallelism, and data parallelism. Operator parallelism is not just
186+
about executing independent operators in parallel. Where applicable, we
187+
can further partition an operator into multiple parallel child
188+
operations. Model parallelism refers to partitioning a computational
189+
graph among several devices in order to shorten the time taken by each
190+
training iteration. And data parallelism involves training the same
191+
computational graph on different data, reducing the total number of
192+
iterations and improving training efficiency. We will discuss these
193+
three parallelism methods in Chapter Distributed Training.
194+
195+
## Synchronous and Asynchronous Data Loading
196+
197+
As previously mentioned, a single training iteration of a computational
198+
graph goes through three serial tasks: data loading, data preprocessing,
199+
and model training. Each task is dependent on the output of the previous
200+
one. To schedule the three types of tasks in iterative graph training,
201+
we can use the synchronous and asynchronous mechanisms at the iteration
202+
level.
203+
204+
1. **Synchronous**: Tasks are executed in order, one after the other.
205+
Tasks have to wait for and coordinate between each other.
206+
207+
2. **Asynchronous**: When a task is complete, the same task in the next
208+
iteration can be executed immediately.
209+
210+
If the synchronous mechanism is adopted to train the computational graph
211+
shown in Figure :numref:`ch04/ch04-tongbu`, in each iteration, a batch of input
212+
data is loaded, preprocessed, and then passed to the computational graph
213+
for model training and parameter update. Tasks in the next iteration
214+
wait until the current iteration is complete. The synchronous mechanism
215+
wastes computation and communication resources because the data
216+
preprocessing and model training tasks must wait until a batch of data
217+
is completely loaded, and because the I/O channel for data loading is
218+
idle at model training time.
219+
220+
![Synchronous mechanism](../img/ch04/sync.png)
221+
:label:`ch04/ch04-tongbu`
222+
223+
In the asynchronous setting shown in Figure
224+
:numref:`ch04/ch04-yibu`, after loading and passing a batch of
225+
input data to the subsequent data preprocessing task, the I/O channel
226+
immediately moves on to the next batch without waiting for the current
227+
iteration to complete. In contrast with the synchronous mechanism, the
228+
idle time between data loading, data preprocessing, and model training
229+
in the asynchronous mechanism is notably reduced, thereby shortening the
230+
overall training time with improved execution efficiency.
231+
232+
![Asynchronous mechanism](../img/ch04/async.png)
233+
:label:`ch04/ch04-yibu`
234+
235+
To further shorten the training time and improve the execution
236+
efficiency, we can combine the asynchronous mechanism with parallel
237+
computing, as shown in Figure
238+
:numref:`ch04/ch04-yibubingxing`. On the one hand, the
239+
asynchronous mechanism reduces the model's wait time for data loading
240+
and preprocessing, allowing the model to quickly traverse the entire
241+
dataset. On the other hand, parallel computing increases the batch size
242+
in iterative training, increasing the efficiency of computing resources.
243+
244+
![Asynchronous mechanism combined with parallelcomputing](../img/ch04/para-async.png)
245+
:label:`ch04/ch04-yibubingxing`

0 commit comments

Comments
 (0)