|
| 1 | +# PaddlePaddle Design Doc |
| 2 | + |
| 3 | +## Ingredients |
| 4 | + |
| 5 | +As the first step of our design, we list important concepts in deep |
| 6 | +learning and try to figure their relationship, as shown below: |
| 7 | + |
| 8 | +``` |
| 9 | +Model = {topology, parameters} |
| 10 | +
|
| 11 | +Evaluator = {Model*, activations} |
| 12 | +- forward |
| 13 | +- test(cost, ...) |
| 14 | +
|
| 15 | +GradientMachine = {Evaluator*, gradients} |
| 16 | +- backward |
| 17 | +
|
| 18 | +Optimizer = {GradientMachine*} |
| 19 | +- train(cost, ...) |
| 20 | +- update |
| 21 | +- checkpoint |
| 22 | +``` |
| 23 | + |
| 24 | +where the pair of curly braces `{` and `}` indicate *composition*, `*` |
| 25 | +indicates a *reference*, and `-` marks a "class method". |
| 26 | + |
| 27 | + |
| 28 | +### Model |
| 29 | + |
| 30 | +We used to think that parameters are part of the topology (or layers). |
| 31 | +But that is not true because multiple layers could share the same |
| 32 | +parameter matrix. An example is a network that compares two text |
| 33 | +segments in a semantic space: |
| 34 | + |
| 35 | +``` |
| 36 | + semantic |
| 37 | +text A -> projection ---\ |
| 38 | + layer A \ |
| 39 | + cosine |
| 40 | + similarity -> output |
| 41 | + layer |
| 42 | + semantic / |
| 43 | +text B -> projection ---/ |
| 44 | + layer B |
| 45 | +``` |
| 46 | + |
| 47 | +In this network, the two semantic projection layers (A and B) share |
| 48 | +the same parameter matrix. |
| 49 | + |
| 50 | +For more information about our API that specifies topology and |
| 51 | +parameter sharing, please refer to [TODO: API]. |
| 52 | + |
| 53 | + |
| 54 | +### Evaluator |
| 55 | + |
| 56 | +Supposed that we have a trained ranking model, we should be able to |
| 57 | +use it in our search engine. The search engine's Web server is a |
| 58 | +concurrent program so to serve many HTTP requests simultaneously. It |
| 59 | +doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters. |
| 60 | +However, each thread should be able to record layer outputs, i.e., |
| 61 | +activations, computed from an input, derived from the request. With |
| 62 | +*Evaluator* that saves activations, we can write the over-simplified |
| 63 | +server program as: |
| 64 | + |
| 65 | +```python |
| 66 | +m = paddle.model.load("trained.model") |
| 67 | + |
| 68 | +http.handle("/", |
| 69 | + lambda req: |
| 70 | + e = paddle.evaluator.create(m) |
| 71 | + e.forward(req) |
| 72 | + e.activation(layer="output")) # returns activations of layer "output" |
| 73 | +``` |
| 74 | + |
| 75 | +### GradientMachine |
| 76 | + |
| 77 | +Similar to the evaluation, the training needs to compute gradients so |
| 78 | +to update model parameters. Because an [optimizer](#optimizer) might |
| 79 | +run multiple simultaneous threads to update the same model, gradients |
| 80 | +should be separated from the model. Because gradients are only used |
| 81 | +in training, but not serving, they should be separate from Evaluator. |
| 82 | +Hence the `GradientMachine`. |
| 83 | + |
| 84 | +### Optimizer |
| 85 | + |
| 86 | +None of Model, Evaluator, nor GradientMachine implements the training |
| 87 | +loop, hence Optimizer. We can define a concurrent optimizer that runs |
| 88 | +multiple simultaneous threads to train a model -- just let each |
| 89 | +thread has its own GradientMachine object. |
| 90 | + |
| 91 | +Most models should be able to be trained using the |
| 92 | +`paddle.optimizer.SGD` by calling its `train` method. Many |
| 93 | +customizations to the SGD algorithm happens with the update equation, |
| 94 | +e.g., momentum and the Adam SGD algorithm. We make `train` calls |
| 95 | +`update` to do an update, so that we can derive a `paddle.optimizer.Adam` |
| 96 | +from `paddle.optimizer.SGD` by overrides only the `update` method. |
| 97 | + |
| 98 | + |
| 99 | +## Programming Interface |
| 100 | + |
| 101 | +A fictive example of PaddlePaddle program looks like the following: |
| 102 | + |
| 103 | +```python |
| 104 | +import paddle |
| 105 | + |
| 106 | +def read(args): |
| 107 | + f = open_file(args["filename"]) |
| 108 | + mb = read_a_minibatch(f) |
| 109 | + end_pass = eof(f) |
| 110 | + if end_pass: |
| 111 | + f = open_file(args["filename"]) # rewind for reading again |
| 112 | + yield mb, end_pass |
| 113 | + |
| 114 | +input = paddle.layer.data(...) |
| 115 | +intermediate = paddle.layers.fc(input) |
| 116 | +output = paddle.layer.softmax(intermediate) |
| 117 | + |
| 118 | +model = paddle.model.create(output) |
| 119 | + |
| 120 | +paddle.train(model, data_provider=read) |
| 121 | +``` |
| 122 | + |
| 123 | +This shows some important part of a program: |
| 124 | + |
| 125 | +1. Define how to read (and augment) data by defining a function, in |
| 126 | + this example, `read`, that `yields` a minibatch and a boolean flag |
| 127 | + `eof_of_pass`. |
| 128 | + |
| 129 | +1. Define the topology, `input`, `intermediate`, and `output` in this |
| 130 | + example. |
| 131 | + |
| 132 | +1. Create parameters from the topology thus forms the model by calling |
| 133 | + `paddel.model.create`. |
| 134 | + |
| 135 | +1. Train the model by calling `paddle.train`. |
| 136 | + |
| 137 | + |
| 138 | +### Reader |
| 139 | + |
| 140 | +Not all programming frameworks allow users to define I/O functions. |
| 141 | +An example is Google MapReduce, which can only read from text, |
| 142 | +SSTable, and RecordIO files. Hadoop MapReduce allows users to define |
| 143 | +readers and writers by deriving from base classes `Reader` and |
| 144 | +`Writer`. The former is less flexible but also less error-prone. We |
| 145 | +decide to provide the flexibility to users to define their readers. |
| 146 | + |
| 147 | + |
| 148 | +#### A Synthetic Data Reader |
| 149 | + |
| 150 | +Sometimes we want to test a topology and/or a training algorithm using |
| 151 | +synthetic data. We can do this by defining the reader a synthesizer: |
| 152 | + |
| 153 | +```python |
| 154 | +def read(args): |
| 155 | + x = sample_from_uniform(0.0, 1.0) |
| 156 | + y = sample_from_gauss(2 * x, sigma) |
| 157 | + yield {x, y}, False # no end-of-file so no end-of-pass |
| 158 | +``` |
| 159 | + |
| 160 | +#### A Reader for Online Learning |
| 161 | + |
| 162 | +Readers can also read an infinite data stream, e.g., a log stream from |
| 163 | +a search engine and collected by Kafka: |
| 164 | + |
| 165 | +```python |
| 166 | +def read(args): |
| 167 | + log_stream = kafka.open_channel(args["kafka channel name"]) |
| 168 | + yeild log_stream.read(), False # no end-of-pass in online learning |
| 169 | +``` |
| 170 | + |
| 171 | +### Topology |
| 172 | + |
| 173 | +By default, layers don't have names. But if we want to refer to a |
| 174 | +layer later some time, for example, when we do serving using the model |
| 175 | +and wants activations/outputs of a layer, we should give it a name. |
| 176 | + |
| 177 | +```python |
| 178 | +input = paddle.layer.data(...) |
| 179 | +intermediate = paddle.layer.fc(input, name="inter", ...) |
| 180 | +output = paddle.layer.softmax(intermediate, name="output", ...) |
| 181 | + |
| 182 | +m = paddle.model.create(output) |
| 183 | +e = paddle.evaluator.create(model) |
| 184 | +e.forward(read_an_input()) # compute activations of all layers. |
| 185 | +print e.activations(layer="inter") # retrieve the activations of layer "inter" |
| 186 | +print e.activations(layer="output") # retrieve the activations of layer "output" |
| 187 | +``` |
| 188 | + |
| 189 | +#### Sharing Parameters |
| 190 | + |
| 191 | +In [above section](#model) we shows a network whose two layers share |
| 192 | +the same parameter matrix. To specify such cases, we give "parameter |
| 193 | +names" to layers. If some layers have the same paraemter names, |
| 194 | +`paddle.model.create` creates a single parameter matrix for these |
| 195 | +layers: |
| 196 | + |
| 197 | +```python |
| 198 | +text1 = paddle.layer.data(...) |
| 199 | +sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection") |
| 200 | +text2 = paddle.layer.data(...) |
| 201 | +sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection") |
| 202 | +out = paddle.layer.cosine(semantic1, semantic2) |
| 203 | +``` |
| 204 | + |
| 205 | +We can also share parameter matrices between layers in different |
| 206 | +models. To do this, we need an additional parameter that refers to a |
| 207 | +model: |
| 208 | + |
| 209 | +```python |
| 210 | +model1_input = paddle.layer.data(...) |
| 211 | +model1_output = paddle.layer.softmax(model1_input, ..., |
| 212 | + parameter_name="a_parameter_matrix") |
| 213 | +model1 = paddle.model.create(model1_output) |
| 214 | + |
| 215 | +# Another model |
| 216 | +model2_semantic = paddle.layer.fc(text2, ..., |
| 217 | + parameter_name="a_parameter_matrix", |
| 218 | + parameter_model=model1) |
| 219 | +``` |
| 220 | + |
| 221 | +### Training |
| 222 | + |
| 223 | +The recommended way to training a model is to call `paddle.train`, |
| 224 | +which simply calls `paddle.optimizer.Default`, a global variable of |
| 225 | +type `paddle.optimizer.SGD`. Equivalently, we can do |
| 226 | + |
| 227 | +```python |
| 228 | +opt = paddle.optimizer.SGD(...) |
| 229 | +opt.train(model, reader=read, ...) |
| 230 | +``` |
| 231 | + |
| 232 | +#### Distributed Training |
| 233 | + |
| 234 | +If users want to do distributed training on a cluster, s/he should |
| 235 | +call `paddle.dist_train` and provides access tokens to the cluster as |
| 236 | +a parameter. |
| 237 | + |
| 238 | +For example, if the user has a TLS certificate that allows him to |
| 239 | +access a Kubernetes cluster, s/he should be able to call |
| 240 | + |
| 241 | +```python |
| 242 | +paddle.dist_train(model, |
| 243 | + reader=read, |
| 244 | + optimizer=paddle.optimizer.SGDOptimizer(...), |
| 245 | + k8s_user="yi", |
| 246 | + k8s_token="kube_cluster_tls.pem", |
| 247 | + k8s_job="hello", |
| 248 | + num_parameter_servers=15) |
| 249 | +``` |
| 250 | + |
| 251 | +The pseudo code if `paddle.dist_train` is as follows: |
| 252 | + |
| 253 | +```python |
| 254 | +def dist_train(): |
| 255 | + if os.getenv("KUBERNETES_SERVICE_HOST") == None: |
| 256 | + image_name = k8s_user + '/' + k8s_job |
| 257 | + docker_build(image_name) |
| 258 | + docker_push() |
| 259 | + kube_ctrl_start_job(image_name, k8s_user, k8s_token) |
| 260 | + else: |
| 261 | + rank = kube_list_containers_in_job_and_return_current_containers_rank() |
| 262 | + if rank == 0: |
| 263 | + master() |
| 264 | + elif rank < 15: |
| 265 | + parameter_server() |
| 266 | + else: |
| 267 | + optimizer.train(model, reader=read) |
| 268 | +``` |
| 269 | + |
| 270 | +Please be aware that if a process is running on the Kubernetes |
| 271 | +cluster, it will have some environment variables pre-defined. |
| 272 | + |
| 273 | +If `dist_train` doesn't see these environment variables, it knowns |
| 274 | +that it's running on users' personal computer, and it should work as a |
| 275 | +*launcher*. Otherwise, it knows that it's running on the cluster and |
| 276 | +need to figure out its role as either the master, or a trainer, or a |
| 277 | +parameter server. |
0 commit comments