Skip to content

Commit f15504e

Browse files
author
Yang Yang(Tony)
authored
Dynamic graph survey (#11019)
* Create dynamic_graph.md * Update dynamic_graph.md * Update dynamic_graph.md * Update dynamic_graph.md * follow comments, add graph * Update dynamic_graph.md * Update dynamic_graph.md
1 parent a6950e0 commit f15504e

File tree

1 file changed

+378
-0
lines changed

1 file changed

+378
-0
lines changed

doc/survey/dynamic_graph.md

Lines changed: 378 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
# Automatic Differentiation with the Tape
2+
3+
## Automatic Differentiation
4+
5+
A key challenge in the field of deep learning is to automatically derive the backward pass from the forward pass described algorithmically by researchers. Such a derivation, or a transformation of the forward pass program, has been long studied before the recent prosperity of deep learning in the field known as [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf).
6+
7+
## The Tape
8+
9+
Given the forward pass program (usually in Python in practices), there are two strategies to derive the backward pass:
10+
11+
1. from the forward pass program itself, or
12+
1. from the execution trace of the forward pass program, which is often known as the *tape*.
13+
14+
This article surveys systems that follow the latter strategy.
15+
16+
## Dynamic Network
17+
18+
When we train a deep learning model, the tape changes every iteration as the input data change, so we have to re-derive the backward pass every iteration. This is known as *dynamic network*.
19+
20+
Deep learning systems that utilize the idea of dynamic network gained their popularities in recent years. This article surveys two representative systems: [PyTorch](https://pytorch.org/) and [DyNet](https://dynet.readthedocs.io/en/latest/).
21+
22+
## An Overview
23+
24+
Both frameworks record a ‘tape’ of the computation and interpreting (or run-time compiling) a transformation of the tape played back in reverse. This tape is a different kind of entity than the original program.[[link]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf)
25+
26+
Consider the following code feedforward model.
27+
28+
```python
29+
x = Variable(randn(20, 1)))
30+
label = Variable(randint(1))
31+
W_1, W_2 = Variable(randn(20, 20)), Variable(randn(10, 20))
32+
h = matmul(W_1, x)
33+
pred = matmul(W_2, x)
34+
loss = softmax(pred, label)
35+
loss.backward()
36+
```
37+
38+
### 1) Dynet uses List to encode the Tape
39+
40+
During the forward execution, a list of operators, in this case `matmul`, `matmul` and `softmax`, are recorded in the tape, along with the necessary information needed to do the backward such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward()`.
41+
42+
<details>
43+
<summary></summary>
44+
digraph g {
45+
graph [
46+
rankdir = "LR"
47+
];
48+
node [
49+
fontsize = "16"
50+
shape = "ellipse"
51+
];
52+
edge [];
53+
"node0" [
54+
label = "<f0> type: matmul | <f1> input: W_1, x | <f2> output: h"
55+
shape = "record"
56+
];
57+
"node1" [
58+
label = "<f0> type: matmul | <f1> input: W_2, h | <f2> output: pred"
59+
shape = "record"
60+
];
61+
"node2" [
62+
label = "<f0> type: softmax | <f1> input: pred, label | <f2> output: loss"
63+
shape = "record"
64+
];
65+
"node0":f0 -> "node1":f0 [];
66+
"node1":f0 -> "node2":f0 [];
67+
}
68+
</details>
69+
70+
![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22ellipse%22%20];%20edge%20[];%20%22node0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_1,%20x%20|%20%3Cf2%3E%20output:%20h%22%20shape%20=%20%22record%22%20];%20%22node1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_2,%20h%20|%20%3Cf2%3E%20output:%20pred%22%20shape%20=%20%22record%22%20];%20%22node2%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20%3Cf1%3E%20input:%20pred,%20label%20|%20%3Cf2%3E%20output:%20loss%22%20shape%20=%20%22record%22%20];%20%22node0%22:f0%20-%3E%20%22node1%22:f0%20[%20id%20=%200%20];%20%22node1%22:f0%20-%3E%20%22node2%22:f0%20[%20id%20=%201%20];%20})
71+
72+
### 2) Pytorch uses Node Graph to encode the Tape
73+
74+
The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order.
75+
76+
<details>
77+
<summary></summary>
78+
digraph g {
79+
graph [
80+
rankdir = "LR"
81+
];
82+
83+
subgraph function {
84+
node [
85+
fontsize = "16"
86+
style = filled
87+
shape = "record"
88+
];
89+
"matmul0" [ label = "<f0> type: matmul | prev_func: None" ];
90+
"matmul1" [ label = "<f0> type: matmul | prev_func: matmul" ];
91+
"softmax" [ label = "<f0> type: softmax | prev_func: matmul" ];
92+
}
93+
94+
subgraph variable {
95+
node [
96+
fontsize = "16"
97+
shape = "Mrecord"
98+
style = filled
99+
fillcolor = white
100+
];
101+
"x" [ label = "<f0> x | <f1> creator: None" ];
102+
"label" [ label = "<f0> label | <f1> creator: None" ];
103+
"W_1" [ label = "<f0> W_1 | <f1> creator: None" ];
104+
"W_2" [ label = "<f0> W_2 | <f1> creator: None" ];
105+
"h" [ label = "<f0> h | <f1> creator: None" ];
106+
"pred" [ label = "<f0> pred | <f1> creator: matmul" ];
107+
"loss" [ label = "<f0> loss | <f1> creator: softmax" ];
108+
}
109+
110+
subgraph data_flow {
111+
"x":f0 -> "matmul0":f0;
112+
"W_1":f0 -> "matmul0":f0;
113+
"matmul0":f0 -> "h":f0;
114+
115+
"h":f0 -> "matmul1":f0;
116+
"W_2":f0 -> "matmul1":f0;
117+
"matmul1":f0 -> "pred":f0;
118+
119+
"pred":f0 -> "softmax":f0;
120+
"label":f0 -> "softmax":f0;
121+
"softmax":f0 -> "loss":f0;
122+
}
123+
124+
subgraph prev_func {
125+
edge [color="red", arrowsize="0.6", penwidth="1", constraint=false];
126+
"matmul1":f1 -> "matmul0":f0;
127+
"softmax":f1 -> "matmul1":f0;
128+
label = "prev_func";
129+
}
130+
}
131+
</details>
132+
133+
![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20subgraph%20function%20{%20node%20[%20fontsize%20=%20%2216%22%20style%20=%20filled%20shape%20=%20%22record%22%20];%20%22matmul0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20None%22%20];%20%22matmul1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20matmul%22%20];%20%22softmax%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20prev_func:%20matmul%22%20];%20}%20subgraph%20variable%20{%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22Mrecord%22%20style%20=%20filled%20fillcolor%20=%20white%20];%20%22x%22%20[%20label%20=%20%22%3Cf0%3E%20x%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22label%22%20[%20label%20=%20%22%3Cf0%3E%20label%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_1%22%20[%20label%20=%20%22%3Cf0%3E%20W_1%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_2%22%20[%20label%20=%20%22%3Cf0%3E%20W_2%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22h%22%20[%20label%20=%20%22%3Cf0%3E%20h%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22pred%22%20[%20label%20=%20%22%3Cf0%3E%20pred%20|%20%3Cf1%3E%20creator:%20matmul%22%20];%20%22loss%22%20[%20label%20=%20%22%3Cf0%3E%20loss%20|%20%3Cf1%3E%20creator:%20softmax%22%20];%20}%20subgraph%20data_flow%20{%20%22x%22:f0%20-%3E%20%22matmul0%22:f0;%20%22W_1%22:f0%20-%3E%20%22matmul0%22:f0;%20%22matmul0%22:f0%20-%3E%20%22h%22:f0;%20%22h%22:f0%20-%3E%20%22matmul1%22:f0;%20%22W_2%22:f0%20-%3E%20%22matmul1%22:f0;%20%22matmul1%22:f0%20-%3E%20%22pred%22:f0;%20%22pred%22:f0%20-%3E%20%22softmax%22:f0;%20%22label%22:f0%20-%3E%20%22softmax%22:f0;%20%22softmax%22:f0%20-%3E%20%22loss%22:f0;%20}%20subgraph%20prev_func%20{%20edge%20[color=%22red%22,%20arrowsize=%220.6%22,%20penwidth=%221%22,%20constraint=false];%20%22matmul1%22:f1%20-%3E%20%22matmul0%22:f0;%20%22softmax%22:f1%20-%3E%20%22matmul1%22:f0;%20label%20=%20%22prev_func%22;%20}%20})
134+
135+
Chainer and Autograd uses the similar techniques to record the forward pass. For details please refer to the appendix.
136+
137+
## Design choices
138+
139+
### 1) Dynet's List vs Pytorch's Node Graph
140+
141+
What's good about List:
142+
1. It avoids a topological sort. One only needs to traverse the list of operators in reverse and calling the corresponding backward operator.
143+
1. It promises effient data parallelism implementations. One could count the time of usage of a certain variable during the construction list. Then in the play back, one knows the calculation of a variable has completed. This enables communication and computation overlapping.
144+
145+
What's good about Node Graph:
146+
1. More flexibility. PyTorch users can mix and match independent graphs however they like, in whatever threads they like (without explicit synchronization). An added benefit of structuring graphs this way is that when a portion of the graph becomes dead, it is automatically freed. [[2]](https://openreview.net/pdf?id=BJJsrmfCZ) Consider the following example, Pytorch only does backward on SmallNet while Dynet does both BigNet and SmallNet.
147+
```python
148+
result = BigNet(data)
149+
loss = SmallNet(data)
150+
loss.backward()
151+
```
152+
153+
### 2) Dynet's Lazy evaluation vs Pytorch's Immediate evaluation
154+
155+
Dynet builds the list in a symbolic matter. Consider the following example
156+
```python
157+
for epoch in range(num_epochs):
158+
for in_words, out_label in training_data:
159+
dy.renew_cg()
160+
W = dy.parameter(W_p)
161+
b = dy.parameter(b_p)
162+
score_sym = dy.softmax(W*dy.concatenate([E[in_words[0]],E[in_words[1]]])+b)
163+
loss_sym = dy.pickneglogsoftmax(score_sym, out_label)
164+
loss_val = loss_sym.value()
165+
loss_sym.backward()
166+
```
167+
The computation of `lookup`, `concat`, `matmul` and `softmax` didn't happen until the call of `loss_sym.value()`. This defered execution is useful because it allows some graph-like optimization possible, e.g. kernel fusion.
168+
169+
Pytorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
170+
171+
172+
## What can fluid learn from them?
173+
174+
TBD
175+
176+
# Appendix
177+
178+
### Overview
179+
180+
| Framework | Has Tape | Core in C++ | First Release Date |
181+
|-----------|----------|-------------|--------------------|
182+
| Autograd | No | No | Mar 5, 2015 |
183+
| Chainer | No | No | Jun 5, 2015 |
184+
| Pytorch | No | Yes | Aug 31, 2016 |
185+
| Dynet | Yes | Yes | Oct 12, 2016 |
186+
187+
### Source Code
188+
#### Autograd
189+
[Backward code](https://github.com/HIPS/autograd/blob/442205dfefe407beffb33550846434baa90c4de7/autograd/core.py#L8-L40). In the forward pass, a graph of VJPNode is constructed.
190+
```python
191+
# User API
192+
def make_grad(fun, x):
193+
start_node = VJPNode.new_root()
194+
end_value, end_node = trace(start_node, fun, x)
195+
return backward_pass(g, end_node), end_value
196+
197+
# trace the forward pass by creating VJPNodes
198+
def trace(start_node, fun, x):
199+
with trace_stack.new_trace() as t:
200+
start_box = new_box(x, t, start_node)
201+
end_box = fun(start_box)
202+
return end_box._value, end_box._node
203+
204+
def backward_pass(g, end_node):
205+
outgrads = {end_node : (g, False)}
206+
for node in toposort(end_node):
207+
outgrad = outgrads.pop(node)
208+
ingrads = node.vjp(outgrad[0])
209+
for parent, ingrad in zip(node.parents, ingrads):
210+
outgrads[parent] = add_outgrads(outgrads.get(parent), ingrad)
211+
return outgrad[0]
212+
213+
# Every VJPNode corresponds to a op_grad
214+
class VJPNode(Node):
215+
__slots__ = ['parents', 'vjp']
216+
def __init__(self, value, fun, args, kwargs, parent_argnums, parents):
217+
self.parents = parents
218+
vjpmaker = primitive_vjps[fun]
219+
self.vjp = vjpmaker(parent_argnums, value, args, kwargs)
220+
```
221+
#### Chainer
222+
Example Code
223+
```python
224+
# (1) Function Set definition, creates FunctionNode
225+
model = FunctionSet(
226+
l1=F.Linear(784, 100),
227+
l2=F.Linear(100, 100),
228+
l3=F.Linear(100, 10)).to_gpu()
229+
230+
# (2) Optimizer Setup
231+
opt = optimizers.SGD()
232+
opt.setup(model)
233+
234+
# (3) Forward computation
235+
def forward(x, t):
236+
h1 = F.relu(model.l1(x))
237+
h2 = F.relu(model.l2(h1))
238+
y = model.l3(h2)
239+
return F.softmax_cross_entropy(y, t)
240+
241+
# (4) Training loop
242+
for epoch in xrange(n_epoch):
243+
for i in xrange(0, N, b_size):
244+
x = Variable(to_gpu(...))
245+
t = Variable(to_gpu(...))
246+
opt.zero_grads()
247+
loss = forward(x, t)
248+
loss.backward()
249+
opt.update()
250+
```
251+
In `forward(x, t)`, a graph of [`VariableNode`](https://github.com/chainer/chainer/blob/master/chainer/variable.py#L110) and [`FunctionNode`](https://github.com/chainer/chainer/blob/a69103a4aa59d5b318f39b01dbcb858d465b89cf/chainer/function_node.py#L19) is constructed. Every output's `VariableNode.creator` is pointed to the `FunctionNode`.
252+
```python
253+
class FunctionNode(object):
254+
...
255+
def apply(self, inputs):
256+
outputs = self.forward(inputs)
257+
ret = tuple([variable.Variable(y, requires_grad=requires_grad)
258+
for y in outputs])
259+
# Topological ordering
260+
self.rank = max([x.rank for x in inputs]) if input_vars else 0
261+
# Add backward edges
262+
for y in ret:
263+
y.creator_node = self
264+
self.inputs = tuple([x.node for x in input_vars])
265+
self.outputs = tuple([y.node for y in ret])
266+
267+
return ret
268+
```
269+
`loss.backward()` will calculate the accumulated gradient of all variables. All the backward of `FunctionNode`s will be called based on the topological order.
270+
```python
271+
class VariableNode(object):
272+
...
273+
def backward(self, retain_grad, loss_scale):
274+
if self.creator_node is None:
275+
return
276+
277+
cand_funcs = []
278+
seen_set = set()
279+
grads = {}
280+
281+
# Initialize error by 1, if this is a loss variable
282+
if self.data.size == 1 and self._grad_var is None:
283+
self.grad = numpy.ones_like(self.data)
284+
grads[self._node] = self._grad_var
285+
286+
def add_cand(cand):
287+
if cand not in seen_set:
288+
# Negate since heapq is min-heap. This is a global variable
289+
heapq.heappush(cand_funcs, (-cand.rank, len(seen_set), cand))
290+
seen_set.add(cand)
291+
292+
add_cand(self.creator_node)
293+
294+
while cand_funcs:
295+
_, _, func = heapq.heappop(cand_funcs)
296+
gxs = func.backward_accumulate(func.inputs, func.outputs, func.outputs.grad)
297+
298+
for x, gx in enumerate(gxs):
299+
if x in grads:
300+
grads[x] += gx
301+
else:
302+
grads[x] = gx
303+
304+
if x.creator_node is not None:
305+
add_cand(x.creator_node)
306+
```
307+
308+
#### PyTorch
309+
Example Code
310+
```python
311+
x = Variable(torch.ones(5, 5))
312+
y = Variable(torch.ones(5, 5) * 4)
313+
z = x ** 2 + x * 2 + x * y + y
314+
z.backward(torch.ones(5, 5))
315+
```
316+
The trace is done by `Variable.creator` and `Function.previous_functions`.
317+
```python
318+
class Variable(object):
319+
def __init__(self, tensor, creator=None, requires_grad=True):
320+
if creator is None:
321+
creator = Leaf(self, requires_grad)
322+
self.data = tensor
323+
self.creator = creator
324+
self._grad = None
325+
326+
def backward(self, gradient=None):
327+
if gradient is None:
328+
if self.data.numel() != 1:
329+
raise RuntimeError('backward should be called only on a scalar (i.e. 1-element tensor) or with gradient w.r.t. the variable')
330+
gradient = self.data.new(1).fill_(1)
331+
self._execution_engine.run_backward(self, gradient)
332+
333+
class Function(obejct):
334+
# ...
335+
def _do_forward(self, *input):
336+
unpacked_input = tuple(arg.data for arg in input)
337+
raw_output = self.forward(*unpacked_input)
338+
339+
# mark output.creator = self for backward trace
340+
output = tuple(Variable(tensor, self) for tensor in raw_output)
341+
342+
self.previous_functions = [(arg.creator, id(arg)) for arg in input]
343+
self.output_ids = {id(var): i for i, var in enumerate(output)}
344+
return output
345+
346+
def _do_backward(self, grad_output):
347+
return self.backwaerd(grad_output)
348+
```
349+
The [backward](https://github.com/pytorch/pytorch/blob/v0.1.1/torch/autograd/engine.py) is similar to Autograd.
350+
351+
#### DyNet
352+
Example code
353+
```python
354+
model = dy.model()
355+
W_p = model.add_parameters((20, 100))
356+
b_p = model.add_parameters(20)
357+
E = model.add_lookup_parameters((20000, 50))
358+
for epoch in range(num_epochs):
359+
for in_words, out_label in training_data:
360+
dy.renew_cg() # init tape
361+
W = dy.parameter(W_p)
362+
b = dy.parameter(b_p)
363+
score_sym = dy.softmax(W*dy.concatenate([E[in_words[0]],E[in_words[1]]])+b)
364+
loss_sym = dy.pickneglogsoftmax(score_sym, out_label)
365+
loss_val = loss_sym.value()
366+
loss_sym.backward()
367+
```
368+
[forward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L84-L158), [backward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L166-L284). The trace is done by creating a tape of expressions in every iteration. Backward is done by traverse the tape in the reverse order.
369+
```c++
370+
void SimpleExecutionEngine::backward(VariableIndex from_where, bool full) {
371+
...
372+
for (int i = num_nodes - 1; i >= 0; --i) {
373+
// each node corresponds to an op
374+
node->backward(xs, node_fx, node_dEdfx, ai, node_dEdxai);
375+
}
376+
...
377+
}
378+
```

0 commit comments

Comments
 (0)