|
| 1 | +# Dynamic Graph on Fluid |
| 2 | + |
| 3 | +PaddlePaddle Fluid is targeting the autodiff without tape, which, however, is very challenging and we are still way from there. DyNet and PyTorch provide a good design idea, the *tape*, that significantly eases the challenge. Also, DyNet provides a C++ API that is as convenient as Python but with higher efficiency and could conveniently integrate with industrial/production systems. This package, `tape`, combines the good of |
| 4 | + |
| 5 | +1. tape from PyTorch and DyNet |
| 6 | +2. C++ API and core from DyNet |
| 7 | +3. rich set of operators from PaddlePaddle |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +We can implement Dynet-like Tape(See this survey) by wrapping Paddle Fluid's `Operator` |
| 12 | +and `Variable`. |
| 13 | + |
| 14 | +The user API is straight forward since |
| 15 | + |
| 16 | +1. it is imperative. And it uses host language's control flow logic. |
| 17 | +1. it avoids extra concepts such as `Scope` and `Executor`. |
| 18 | + |
| 19 | +All of these benefits come at the cost of just adding one line `reset_global_tape` |
| 20 | +at every iteration. |
| 21 | + |
| 22 | +## Code Structure |
| 23 | + |
| 24 | +In short, the `Tape` contains a vector of `OpHandle`s. And an `OpHandle` contains its |
| 25 | +`type`, the pointers to the `Variable`s, and necessary attributes. |
| 26 | + |
| 27 | +```c++ |
| 28 | +class Variable { |
| 29 | +public: |
| 30 | + VriableHandle Grad(); // returns its gradient variable |
| 31 | +private: |
| 32 | + framework::VarDesc desc_; // compile time infershape, necessary for lazy execution |
| 33 | + framework::Variable var_; // run time variable, holds data memory |
| 34 | +}; |
| 35 | + |
| 36 | +using VariableHandle = shared_ptr<Variable>; |
| 37 | + |
| 38 | +struct OpHandle { |
| 39 | + string type_; |
| 40 | + map<string, vector<VariableHandle>> inputs_; |
| 41 | + map<string, vector<VariableHandle>> outputs_; |
| 42 | + AttributeMap attrs_; |
| 43 | +}; |
| 44 | + |
| 45 | +class Tape { |
| 46 | +public: |
| 47 | + void AddOp(OpHandle); // add op |
| 48 | + void Forward(); // execute the tape_ |
| 49 | + void Backward(); // execute the backward of the tape_ |
| 50 | +private: |
| 51 | + vector<OpHandle> tape_; |
| 52 | +}; |
| 53 | +``` |
| 54 | +
|
| 55 | +We uses `Function` to indicate layers. It takes care of parameter |
| 56 | +initialization and `AddOp` to the Tape when it is called. |
| 57 | +
|
| 58 | +```c++ |
| 59 | +class Linear { |
| 60 | + public: |
| 61 | + Linear(int in_dim, int out_dim, const std::string &act) |
| 62 | + : w_(new Variable("LinearWeight")), |
| 63 | + b_(new Variable("LinearBias")), |
| 64 | + act_(act) { |
| 65 | + Tape init_tape; |
| 66 | +
|
| 67 | + std::string initializer = "fill_constant"; |
| 68 | + framework::AttributeMap attrs; |
| 69 | + attrs["dtype"] = paddle::framework::proto::VarType::Type::VarType_Type_FP32; |
| 70 | + attrs["shape"] = std::vector<int>{in_dim, out_dim}; |
| 71 | + attrs["value"] = 1.0f; |
| 72 | + init_tape.AddOp(initializer, {}, {{"Out", {w_}}}, attrs); |
| 73 | +
|
| 74 | + attrs["dtype"] = paddle::framework::proto::VarType::Type::VarType_Type_FP32; |
| 75 | + attrs["shape"] = std::vector<int>{out_dim}; |
| 76 | + attrs["value"] = 1.0f; |
| 77 | + init_tape.AddOp(initializer, {}, {{"Out", {b_}}}, attrs); |
| 78 | +
|
| 79 | + init_tape.Forward(); |
| 80 | + } |
| 81 | +
|
| 82 | + VariableHandle operator()(VariableHandle input) { |
| 83 | + VariableHandle pre_bias(new Variable("linear")); |
| 84 | + get_global_tape().AddOp("mul", |
| 85 | + {{"X", {input}}, {"Y", {w_}}}, |
| 86 | + {{"Out", {pre_bias}}}, |
| 87 | + {{"x_num_col_dims", 1}, {"y_num_col_dims", 1}}); |
| 88 | + VariableHandle pre_act(new Variable("linear")); |
| 89 | + get_global_tape().AddOp("elementwise_add", |
| 90 | + {{"X", {pre_bias}}, {"Y", {b_}}}, |
| 91 | + {{"Out", {pre_act}}}, |
| 92 | + {{"axis", 1}}); |
| 93 | + VariableHandle post_act(new Variable("linear")); |
| 94 | + get_global_tape().AddOp(act_, |
| 95 | + {{"X", {pre_act}}}, |
| 96 | + {{"Out", {post_act}}}, |
| 97 | + {}); |
| 98 | + return post_act; |
| 99 | + } |
| 100 | +
|
| 101 | + std::vector<VariableHandle> Params() { return {w_, b_}; } |
| 102 | +
|
| 103 | + private: |
| 104 | + VariableHandle w_; |
| 105 | + VariableHandle b_; |
| 106 | + std::string act_; |
| 107 | +}; |
| 108 | +``` |
| 109 | + |
| 110 | +## User API |
| 111 | + |
| 112 | +```c++ |
| 113 | +// Model function |
| 114 | +paddle::tape::Linear linear1(3, 3, "relu"); // init weight and bias |
| 115 | +paddle::tape::Linear linear2(3, 3, "relu"); // init weight and bias |
| 116 | +paddle::tape::Mean mean; |
| 117 | + |
| 118 | +// Optimizer |
| 119 | +paddle::tape::SGD sgd(0.001); |
| 120 | + |
| 121 | +// Data Feeder |
| 122 | +paddle::tape::Fill data_feeder(...); |
| 123 | +VariableHandle input(new paddle::tape::Variable("input")); |
| 124 | + |
| 125 | +for (int i = 0; i < 2; ++i) { |
| 126 | + reset_global_tape(); |
| 127 | + |
| 128 | + data_feeder(input); |
| 129 | + |
| 130 | + auto loss = mean(linear2(linear1(input))); // compile time InferShape & InferVarType |
| 131 | + LOG(INFO) << loss.value(); // Run forward up to loss |
| 132 | + |
| 133 | + // Run backward, store gradient of w at w->Grad() |
| 134 | + get_global_tape.Backward(loss); |
| 135 | + |
| 136 | + // Update w |
| 137 | + sgd(linear1.Params()); |
| 138 | + sgd(linear2.Params()); |
| 139 | +} |
| 140 | +``` |
| 141 | +
|
| 142 | +<details> |
| 143 | + <summary></summary> |
| 144 | +digraph G { |
| 145 | +
|
| 146 | + subgraph cluster_0 { |
| 147 | + node [shape=record,style=filled]; |
| 148 | + style=filled; |
| 149 | + color=lightgrey; |
| 150 | + linear1 [label="{type: mul | {input | {<before_mul1>X: before_mul1 |<weight1> Y: weight1}} | {output |<before_bias1> Out: before_bias1}}"]; |
| 151 | + elementwise_add1 [label="{type: elementwise_add | {input | {<before_bias1>X: before_bias1 |<bias1> Y: bias1}} | {output |<before_act1> Out: before_act1}}"]; |
| 152 | + relu1 [label="{type: relu | {input | {<before_act1>X: before_act1 }} | {output |<after_act1> Out: after_act1}}"]; |
| 153 | +
|
| 154 | + linear1 -> elementwise_add1->relu1; |
| 155 | + label = "forward tape"; |
| 156 | + } |
| 157 | +
|
| 158 | + linear1:before_mul1->before_mul1 |
| 159 | + linear1:weight1->weight1 |
| 160 | + linear1:before_bias1->before_bias1 |
| 161 | +
|
| 162 | + elementwise_add1:bias1->bias1 |
| 163 | + elementwise_add1:before_bias1->before_bias1 |
| 164 | + elementwise_add1:before_act1->before_act1 |
| 165 | +
|
| 166 | + relu1:before_act1->before_act1 |
| 167 | + relu1:after_act1->after_act1 |
| 168 | +
|
| 169 | + subgraph cluster_1 { |
| 170 | + node [shape=record,style=filled]; |
| 171 | + style=filled; |
| 172 | + color=lightgrey; |
| 173 | + linear1_grad [label="{type: mul_grad | {input | {<before_mul1>X: before_mul1 |<weight1> Y: weight1|<before_bias1_grad> Out_grad: before_bias1_grad}} | {output |{<before_mul1_grad>X_grad: before_mul1_grad |<weight1_grad> Y_grad: weight1_grad}}}"]; |
| 174 | +
|
| 175 | + elementwise_add1_grad [label="{type: elementwise_add_grad | {input | <before_act1_grad> Out_grad: before_act1_grad} | {output |{<before_bias1_grad>X_grad: before_bias1_grad |<bias1_grad> Y_grad: bias1_grad}}}"]; |
| 176 | +
|
| 177 | + relu1_grad [label="{type: relu_grad | {input |<after_act1_grad> Out_grad: after_act1_grad} | {ouput | {<before_act1_grad>X_grad: before_act1_grad }}}"]; |
| 178 | +
|
| 179 | + linear1_grad -> elementwise_add1_grad ->relu1_grad [dir=back]; |
| 180 | + label = "backward tape"; |
| 181 | + } |
| 182 | +
|
| 183 | + relu1_grad:after_act1_grad->after_act1_grad |
| 184 | + relu1_grad:before_act1_grad->before_act1_grad |
| 185 | +
|
| 186 | + elementwise_add1_grad:before_act1_grad->before_act1_grad |
| 187 | + elementwise_add1_grad:before_bias1_grad->before_bias1_grad |
| 188 | + elementwise_add1_grad:bias1_grad->bias1_grad |
| 189 | +
|
| 190 | + linear1_grad:before_mul1->before_mul1 |
| 191 | + linear1_grad:weight1->weight1 |
| 192 | + linear1_grad:before_bias1_grad->before_bias1_grad |
| 193 | + linear1_grad:before_mul1_grad->before_mul1_grad |
| 194 | + linear1_grad:weight1_grad->weight1_grad |
| 195 | +
|
| 196 | +
|
| 197 | + subgraph cluster_2 { |
| 198 | + node [shape=record]; |
| 199 | + label = "Linear1"; |
| 200 | + weight1 |
| 201 | + bias1 |
| 202 | + } |
| 203 | +
|
| 204 | + weight1 -> weight1_grad [ label="Grad()", style="dashed" ]; |
| 205 | + bias1 -> bias1_grad [ label="Grad()", style="dashed"]; |
| 206 | +
|
| 207 | + |
| 208 | +
|
| 209 | +} |
| 210 | +</details> |
| 211 | +
|
| 212 | + |
| 213 | +
|
| 214 | +## Code Reuse |
| 215 | +
|
| 216 | +We want to stay close to Paddle Fluid as much as possible. |
| 217 | +
|
| 218 | +### Reuse All Operators |
| 219 | +
|
| 220 | +As all Ops are registered at `OpInfoMap`, the effort of adding a new `Function` |
| 221 | +is about 10 lines of code, similar to expose an operator to Python. |
| 222 | +
|
| 223 | +### Reuse Compile Time InferShape and InferVarType |
| 224 | +
|
| 225 | +Note that all the symbolic information is stored at `tape::Varaible::desc_`, instead |
| 226 | +of `ProgramDesc.block.vars`, we create a temporary `BlockDesc` to do `InferShape` and |
| 227 | +`InferVarType` every time we `AddOp` to the tape. |
| 228 | +
|
| 229 | +### Reuse Operator::Run |
| 230 | +
|
| 231 | +We use smart pointer, instead of `Scope`, to manage memory. So we create a temporary |
| 232 | +`Scope` for every `Operator::Run()`. |
| 233 | +
|
| 234 | +## Possible Feature |
| 235 | +
|
| 236 | +### Release Memory on Backward |
| 237 | +
|
| 238 | +We can release memory aggressively. During backward, we can delete the OpHandle once |
| 239 | +we have finished its backward. Since all the variable is managed by smart pointer, the |
| 240 | +memory is automatically released when its `ref_count` goes to 0. |
| 241 | +
|
| 242 | +### Kernel Fusion |
| 243 | +
|
| 244 | +As a symbolic representation of the Tape is constructed first before the actual |
| 245 | +execution, it would be possible to perform graph optimization. One use case is kernel |
| 246 | +fusion. |
0 commit comments