You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/survey/dynamic_graph.md
+31-30Lines changed: 31 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,28 +2,31 @@
2
2
3
3
## Automatic Differentiation
4
4
5
-
A key challenge in the field of deep learning is to automatically derive the backward pass from the forward pass described algorithmically by researchers. Such a derivation, or a transformation of the forward pass program, has been long studied before the recent prosperity of deep learning in the field known as [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf).
5
+
A key challenge in deep learning is to automatically derive the backward pass given the forward pass as a program, which has been long studied in the field of [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf), or autodiff, before the prosperity of deep learning.
6
6
7
-
## The Tape
7
+
## Program Transformation v.s. Backtracking
8
8
9
-
Given the forward pass program (usually in Python in practices), there are two strategies to derive the backward pass:
9
+
Given the forward pass program, there are two strategies to derive the backward pass:
10
10
11
-
1.from the forward pass program itself, or
12
-
1.from the execution trace of the forward pass program, which is often known as the *tape*.
11
+
1.by transforming the forward pass program without executing it, or
12
+
1.by backtracking the execution process of the forward pass program.
13
13
14
-
This article surveys systems that follow the latter strategy.
14
+
This article is about the latter strategy.
15
15
16
-
## Dynamic Network
16
+
## The Tape and Dynamic Networks
17
17
18
-
When we train a deep learning model, the tape changes every iteration as the input data change, so we have to re-derive the backward pass every iteration. This is known as *dynamic network*.
18
+
We refer to the trace of the execution of the forward pass program as a *tape*[[1]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf). When we train a deep learning model, the tape changes every iteration as the input data change, so we'd have to re-derive the backward pass, which is time-consuming, but also eases the case that the forward program includes control flows like if-else and for/while. With these control flows, the execution trace might change with iterations. Such changes are known as *dynamic networks* in the field of deep learning.
19
19
20
-
Deep learning systems that utilize the idea of dynamic network gained their popularities in recent years. This article surveys two representative systems: [PyTorch](https://pytorch.org/) and [DyNet](https://dynet.readthedocs.io/en/latest/).
20
+
## Typical Systems
21
21
22
-
## An Overview
22
+
Deep learning systems that utilize the idea of dynamic networks gained their popularities in recent years. This article surveys the following typical systems:
23
23
24
-
Both frameworks record a ‘tape’ of the computation and interpreting (or run-time compiling) a transformation of the tape played back in reverse. This tape is a different kind of entity than the original program.[[link]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf)
24
+
-[DyNet](https://dynet.readthedocs.io/en/latest/)
25
+
-[PyTorch](https://pytorch.org/)
26
+
- Chainer
27
+
- Autograd from HIPS
25
28
26
-
Consider the following code feedforward model.
29
+
Before diving into these systems, let us pose an example forward pass program:
27
30
28
31
```python
29
32
x = Variable(randn(20, 1)))
@@ -35,9 +38,11 @@ loss = softmax(pred, label)
35
38
loss.backward()
36
39
```
37
40
38
-
### 1) Dynet uses List to encode the Tape
41
+
##The Representation of Tapes
39
42
40
-
During the forward execution, a list of operators, in this case `matmul`, `matmul` and `softmax`, are recorded in the tape, along with the necessary information needed to do the backward such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward()`.
43
+
### DyNet: the Tape as a List
44
+
45
+
DyNet uses a linear data structure, a list, to represent the tape. During the execution of the above example, it is a list of operators: `matmul`, `matmul`, and `softmax`. The list also includes information needed to do the backward pass, such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward().`
The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order.
79
+
The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order. Please be aware that a `Function` might have more than one `prev_func`s.
Chainer and Autograd uses the similar techniques to record the forward pass. For details please refer to the appendix.
136
-
137
-
## Design choices
140
+
Chainer and Autograd use the similar techniques to record the forward pass. For details, please refer to the appendix.
138
141
139
-
### 1) Dynet's List vs Pytorch's Node Graph
142
+
##Comparison: List v.s. Graph
140
143
141
-
What's good about List:
142
-
1. It avoids a topological sort. One only needs to traverse the list of operators in reverse and calling the corresponding backward operator.
143
-
1. It promises effient data parallelism implementations. One could count the time of usage of a certain variable during the construction list. Then in the play back, one knows the calculation of a variable has completed. This enables communication and computation overlapping.
144
+
The list of DyNet could be considered the result of the topological sort of the graph of PyTorch. Or, the graph is the raw representation of the tape, which gives us the chance to *prune* part of the graph that is irrelevant with the backward pass before the topological sort [[2]](https://openreview.net/pdf?id=BJJsrmfCZ). Consider the following example, PyTorch only does backward on `SmallNet` while DyNet does both `SmallNet` and `BigNet`:
144
145
145
-
What's good about Node Graph:
146
-
1. More flexibility. PyTorch users can mix and match independent graphs however they like, in whatever threads they like (without explicit synchronization). An added benefit of structuring graphs this way is that when a portion of the graph becomes dead, it is automatically freed. [[2]](https://openreview.net/pdf?id=BJJsrmfCZ) Consider the following example, Pytorch only does backward on SmallNet while Dynet does both BigNet and SmallNet.
147
146
```python
148
147
result = BigNet(data)
149
148
loss = SmallNet(data)
150
149
loss.backward()
151
150
```
152
151
153
-
### 2) Dynet's Lazy evaluation vs Pytorch's Immediate evaluation
152
+
## Lazy v.s. Immediate Evaluation
153
+
154
+
Another difference between DyNet and PyTorch is that DyNet lazily evaluates the forward pass, whereas PyTorch executes it immediately. Consider the following example:
154
155
155
-
Dynet builds the list in a symbolic matter. Consider the following example
156
156
```python
157
157
for epoch inrange(num_epochs):
158
158
for in_words, out_label in training_data:
@@ -164,16 +164,17 @@ for epoch in range(num_epochs):
164
164
loss_val = loss_sym.value()
165
165
loss_sym.backward()
166
166
```
167
+
167
168
The computation of `lookup`, `concat`, `matmul` and `softmax` didn't happen until the call of `loss_sym.value()`. This defered execution is useful because it allows some graph-like optimization possible, e.g. kernel fusion.
168
169
169
-
Pytorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
170
+
PyTorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
0 commit comments