Skip to content

Commit d827c6e

Browse files
author
Yang Yang(Tony)
authored
Dynamic Graph first prototype (#11415)
1 parent a77dfee commit d827c6e

File tree

13 files changed

+914
-3
lines changed

13 files changed

+914
-3
lines changed

doc/survey/dynamic_graph.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ Pytorch chooses immediate evaluation. It avoids ever materializing a "forward gr
171171

172172
## What can fluid learn from them?
173173

174-
TBD
174+
Please refer to `paddle/contrib/dynamic/`.
175175

176176
# Appendix
177177

paddle/contrib/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@
1414
#
1515

1616
add_subdirectory(inference)
17+
add_subdirectory(dynamic)

paddle/contrib/dynamic/CMakeLists.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
16+
if(APPLE)
17+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-error=pessimizing-move")
18+
endif(APPLE)
19+
20+
cc_library(tape_variable SRCS variable.cc DEPS ${FLUID_CORE_MODULES})
21+
cc_library(tape SRCS tape.cc DEPS ${FLUID_CORE_MODULES} ${GLOB_OP_LIB} tape_variable)
22+
23+
cc_test(test_tape
24+
SRCS test_tape.cc
25+
DEPS tape tape_variable)

paddle/contrib/dynamic/README.md

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# Dynamic Graph on Fluid
2+
3+
PaddlePaddle Fluid is targeting the autodiff without tape, which, however, is very challenging and we are still way from there. DyNet and PyTorch provide a good design idea, the *tape*, that significantly eases the challenge. Also, DyNet provides a C++ API that is as convenient as Python but with higher efficiency and could conveniently integrate with industrial/production systems. This package, `tape`, combines the good of
4+
5+
1. tape from PyTorch and DyNet
6+
2. C++ API and core from DyNet
7+
3. rich set of operators from PaddlePaddle
8+
9+
## Overview
10+
11+
We can implement Dynet-like Tape(See this survey) by wrapping Paddle Fluid's `Operator`
12+
and `Variable`.
13+
14+
The user API is straight forward since
15+
16+
1. it is imperative. And it uses host language's control flow logic.
17+
1. it avoids extra concepts such as `Scope` and `Executor`.
18+
19+
All of these benefits come at the cost of just adding one line `reset_global_tape`
20+
at every iteration.
21+
22+
## Code Structure
23+
24+
In short, the `Tape` contains a vector of `OpHandle`s. And an `OpHandle` contains its
25+
`type`, the pointers to the `Variable`s, and necessary attributes.
26+
27+
```c++
28+
class Variable {
29+
public:
30+
VriableHandle Grad(); // returns its gradient variable
31+
private:
32+
framework::VarDesc desc_; // compile time infershape, necessary for lazy execution
33+
framework::Variable var_; // run time variable, holds data memory
34+
};
35+
36+
using VariableHandle = shared_ptr<Variable>;
37+
38+
struct OpHandle {
39+
string type_;
40+
map<string, vector<VariableHandle>> inputs_;
41+
map<string, vector<VariableHandle>> outputs_;
42+
AttributeMap attrs_;
43+
};
44+
45+
class Tape {
46+
public:
47+
void AddOp(OpHandle); // add op
48+
void Forward(); // execute the tape_
49+
void Backward(); // execute the backward of the tape_
50+
private:
51+
vector<OpHandle> tape_;
52+
};
53+
```
54+
55+
We uses `Function` to indicate layers. It takes care of parameter
56+
initialization and `AddOp` to the Tape when it is called.
57+
58+
```c++
59+
class Linear {
60+
public:
61+
Linear(int in_dim, int out_dim, const std::string &act)
62+
: w_(new Variable("LinearWeight")),
63+
b_(new Variable("LinearBias")),
64+
act_(act) {
65+
Tape init_tape;
66+
67+
std::string initializer = "fill_constant";
68+
framework::AttributeMap attrs;
69+
attrs["dtype"] = paddle::framework::proto::VarType::Type::VarType_Type_FP32;
70+
attrs["shape"] = std::vector<int>{in_dim, out_dim};
71+
attrs["value"] = 1.0f;
72+
init_tape.AddOp(initializer, {}, {{"Out", {w_}}}, attrs);
73+
74+
attrs["dtype"] = paddle::framework::proto::VarType::Type::VarType_Type_FP32;
75+
attrs["shape"] = std::vector<int>{out_dim};
76+
attrs["value"] = 1.0f;
77+
init_tape.AddOp(initializer, {}, {{"Out", {b_}}}, attrs);
78+
79+
init_tape.Forward();
80+
}
81+
82+
VariableHandle operator()(VariableHandle input) {
83+
VariableHandle pre_bias(new Variable("linear"));
84+
get_global_tape().AddOp("mul",
85+
{{"X", {input}}, {"Y", {w_}}},
86+
{{"Out", {pre_bias}}},
87+
{{"x_num_col_dims", 1}, {"y_num_col_dims", 1}});
88+
VariableHandle pre_act(new Variable("linear"));
89+
get_global_tape().AddOp("elementwise_add",
90+
{{"X", {pre_bias}}, {"Y", {b_}}},
91+
{{"Out", {pre_act}}},
92+
{{"axis", 1}});
93+
VariableHandle post_act(new Variable("linear"));
94+
get_global_tape().AddOp(act_,
95+
{{"X", {pre_act}}},
96+
{{"Out", {post_act}}},
97+
{});
98+
return post_act;
99+
}
100+
101+
std::vector<VariableHandle> Params() { return {w_, b_}; }
102+
103+
private:
104+
VariableHandle w_;
105+
VariableHandle b_;
106+
std::string act_;
107+
};
108+
```
109+
110+
## User API
111+
112+
```c++
113+
// Model function
114+
paddle::tape::Linear linear1(3, 3, "relu"); // init weight and bias
115+
paddle::tape::Linear linear2(3, 3, "relu"); // init weight and bias
116+
paddle::tape::Mean mean;
117+
118+
// Optimizer
119+
paddle::tape::SGD sgd(0.001);
120+
121+
// Data Feeder
122+
paddle::tape::Fill data_feeder(...);
123+
VariableHandle input(new paddle::tape::Variable("input"));
124+
125+
for (int i = 0; i < 2; ++i) {
126+
reset_global_tape();
127+
128+
data_feeder(input);
129+
130+
auto loss = mean(linear2(linear1(input))); // compile time InferShape & InferVarType
131+
LOG(INFO) << loss.value(); // Run forward up to loss
132+
133+
// Run backward, store gradient of w at w->Grad()
134+
get_global_tape.Backward(loss);
135+
136+
// Update w
137+
sgd(linear1.Params());
138+
sgd(linear2.Params());
139+
}
140+
```
141+
142+
<details>
143+
<summary></summary>
144+
digraph G {
145+
146+
subgraph cluster_0 {
147+
node [shape=record,style=filled];
148+
style=filled;
149+
color=lightgrey;
150+
linear1 [label="{type: mul | {input | {<before_mul1>X: before_mul1 |<weight1> Y: weight1}} | {output |<before_bias1> Out: before_bias1}}"];
151+
elementwise_add1 [label="{type: elementwise_add | {input | {<before_bias1>X: before_bias1 |<bias1> Y: bias1}} | {output |<before_act1> Out: before_act1}}"];
152+
relu1 [label="{type: relu | {input | {<before_act1>X: before_act1 }} | {output |<after_act1> Out: after_act1}}"];
153+
154+
linear1 -> elementwise_add1->relu1;
155+
label = "forward tape";
156+
}
157+
158+
linear1:before_mul1->before_mul1
159+
linear1:weight1->weight1
160+
linear1:before_bias1->before_bias1
161+
162+
elementwise_add1:bias1->bias1
163+
elementwise_add1:before_bias1->before_bias1
164+
elementwise_add1:before_act1->before_act1
165+
166+
relu1:before_act1->before_act1
167+
relu1:after_act1->after_act1
168+
169+
subgraph cluster_1 {
170+
node [shape=record,style=filled];
171+
style=filled;
172+
color=lightgrey;
173+
linear1_grad [label="{type: mul_grad | {input | {<before_mul1>X: before_mul1 |<weight1> Y: weight1|<before_bias1_grad> Out_grad: before_bias1_grad}} | {output |{<before_mul1_grad>X_grad: before_mul1_grad |<weight1_grad> Y_grad: weight1_grad}}}"];
174+
175+
elementwise_add1_grad [label="{type: elementwise_add_grad | {input | <before_act1_grad> Out_grad: before_act1_grad} | {output |{<before_bias1_grad>X_grad: before_bias1_grad |<bias1_grad> Y_grad: bias1_grad}}}"];
176+
177+
relu1_grad [label="{type: relu_grad | {input |<after_act1_grad> Out_grad: after_act1_grad} | {ouput | {<before_act1_grad>X_grad: before_act1_grad }}}"];
178+
179+
linear1_grad -> elementwise_add1_grad ->relu1_grad [dir=back];
180+
label = "backward tape";
181+
}
182+
183+
relu1_grad:after_act1_grad->after_act1_grad
184+
relu1_grad:before_act1_grad->before_act1_grad
185+
186+
elementwise_add1_grad:before_act1_grad->before_act1_grad
187+
elementwise_add1_grad:before_bias1_grad->before_bias1_grad
188+
elementwise_add1_grad:bias1_grad->bias1_grad
189+
190+
linear1_grad:before_mul1->before_mul1
191+
linear1_grad:weight1->weight1
192+
linear1_grad:before_bias1_grad->before_bias1_grad
193+
linear1_grad:before_mul1_grad->before_mul1_grad
194+
linear1_grad:weight1_grad->weight1_grad
195+
196+
197+
subgraph cluster_2 {
198+
node [shape=record];
199+
label = "Linear1";
200+
weight1
201+
bias1
202+
}
203+
204+
weight1 -> weight1_grad [ label="Grad()", style="dashed" ];
205+
bias1 -> bias1_grad [ label="Grad()", style="dashed"];
206+
207+
208+
209+
}
210+
</details>
211+
212+
![Image](https://github.com/tonyyang-svail/Paddle/blob/cpp_tap/paddle/contrib/dynamic/computation_graph.png)
213+
214+
## Code Reuse
215+
216+
We want to stay close to Paddle Fluid as much as possible.
217+
218+
### Reuse All Operators
219+
220+
As all Ops are registered at `OpInfoMap`, the effort of adding a new `Function`
221+
is about 10 lines of code, similar to expose an operator to Python.
222+
223+
### Reuse Compile Time InferShape and InferVarType
224+
225+
Note that all the symbolic information is stored at `tape::Varaible::desc_`, instead
226+
of `ProgramDesc.block.vars`, we create a temporary `BlockDesc` to do `InferShape` and
227+
`InferVarType` every time we `AddOp` to the tape.
228+
229+
### Reuse Operator::Run
230+
231+
We use smart pointer, instead of `Scope`, to manage memory. So we create a temporary
232+
`Scope` for every `Operator::Run()`.
233+
234+
## Possible Feature
235+
236+
### Release Memory on Backward
237+
238+
We can release memory aggressively. During backward, we can delete the OpHandle once
239+
we have finished its backward. Since all the variable is managed by smart pointer, the
240+
memory is automatically released when its `ref_count` goes to 0.
241+
242+
### Kernel Fusion
243+
244+
As a symbolic representation of the Tape is constructed first before the actual
245+
execution, it would be possible to perform graph optimization. One use case is kernel
246+
fusion.
94.4 KB
Loading

0 commit comments

Comments
 (0)