Skip to content

Commit 0e6a11a

Browse files
authored
Merge pull request #1297 from wangkuiyi/design_doc_new_api
New API Design Doc
2 parents d6292cc + 6de262c commit 0e6a11a

File tree

1 file changed

+277
-0
lines changed

1 file changed

+277
-0
lines changed

doc/design/api.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# PaddlePaddle Design Doc
2+
3+
## Ingredients
4+
5+
As the first step of our design, we list important concepts in deep
6+
learning and try to figure their relationship, as shown below:
7+
8+
```
9+
Model = {topology, parameters}
10+
11+
Evaluator = {Model*, activations}
12+
- forward
13+
- test(cost, ...)
14+
15+
GradientMachine = {Evaluator*, gradients}
16+
- backward
17+
18+
Optimizer = {GradientMachine*}
19+
- train(cost, ...)
20+
- update
21+
- checkpoint
22+
```
23+
24+
where the pair of curly braces `{` and `}` indicate *composition*, `*`
25+
indicates a *reference*, and `-` marks a "class method".
26+
27+
28+
### Model
29+
30+
We used to think that parameters are part of the topology (or layers).
31+
But that is not true because multiple layers could share the same
32+
parameter matrix. An example is a network that compares two text
33+
segments in a semantic space:
34+
35+
```
36+
semantic
37+
text A -> projection ---\
38+
layer A \
39+
cosine
40+
similarity -> output
41+
layer
42+
semantic /
43+
text B -> projection ---/
44+
layer B
45+
```
46+
47+
In this network, the two semantic projection layers (A and B) share
48+
the same parameter matrix.
49+
50+
For more information about our API that specifies topology and
51+
parameter sharing, please refer to [TODO: API].
52+
53+
54+
### Evaluator
55+
56+
Supposed that we have a trained ranking model, we should be able to
57+
use it in our search engine. The search engine's Web server is a
58+
concurrent program so to serve many HTTP requests simultaneously. It
59+
doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters.
60+
However, each thread should be able to record layer outputs, i.e.,
61+
activations, computed from an input, derived from the request. With
62+
*Evaluator* that saves activations, we can write the over-simplified
63+
server program as:
64+
65+
```python
66+
m = paddle.model.load("trained.model")
67+
68+
http.handle("/",
69+
lambda req:
70+
e = paddle.evaluator.create(m)
71+
e.forward(req)
72+
e.activation(layer="output")) # returns activations of layer "output"
73+
```
74+
75+
### GradientMachine
76+
77+
Similar to the evaluation, the training needs to compute gradients so
78+
to update model parameters. Because an [optimizer](#optimizer) might
79+
run multiple simultaneous threads to update the same model, gradients
80+
should be separated from the model. Because gradients are only used
81+
in training, but not serving, they should be separate from Evaluator.
82+
Hence the `GradientMachine`.
83+
84+
### Optimizer
85+
86+
None of Model, Evaluator, nor GradientMachine implements the training
87+
loop, hence Optimizer. We can define a concurrent optimizer that runs
88+
multiple simultaneous threads to train a model -- just let each
89+
thread has its own GradientMachine object.
90+
91+
Most models should be able to be trained using the
92+
`paddle.optimizer.SGD` by calling its `train` method. Many
93+
customizations to the SGD algorithm happens with the update equation,
94+
e.g., momentum and the Adam SGD algorithm. We make `train` calls
95+
`update` to do an update, so that we can derive a `paddle.optimizer.Adam`
96+
from `paddle.optimizer.SGD` by overrides only the `update` method.
97+
98+
99+
## Programming Interface
100+
101+
A fictive example of PaddlePaddle program looks like the following:
102+
103+
```python
104+
import paddle
105+
106+
def read(args):
107+
f = open_file(args["filename"])
108+
mb = read_a_minibatch(f)
109+
end_pass = eof(f)
110+
if end_pass:
111+
f = open_file(args["filename"]) # rewind for reading again
112+
yield mb, end_pass
113+
114+
input = paddle.layer.data(...)
115+
intermediate = paddle.layers.fc(input)
116+
output = paddle.layer.softmax(intermediate)
117+
118+
model = paddle.model.create(output)
119+
120+
paddle.train(model, data_provider=read)
121+
```
122+
123+
This shows some important part of a program:
124+
125+
1. Define how to read (and augment) data by defining a function, in
126+
this example, `read`, that `yields` a minibatch and a boolean flag
127+
`eof_of_pass`.
128+
129+
1. Define the topology, `input`, `intermediate`, and `output` in this
130+
example.
131+
132+
1. Create parameters from the topology thus forms the model by calling
133+
`paddel.model.create`.
134+
135+
1. Train the model by calling `paddle.train`.
136+
137+
138+
### Reader
139+
140+
Not all programming frameworks allow users to define I/O functions.
141+
An example is Google MapReduce, which can only read from text,
142+
SSTable, and RecordIO files. Hadoop MapReduce allows users to define
143+
readers and writers by deriving from base classes `Reader` and
144+
`Writer`. The former is less flexible but also less error-prone. We
145+
decide to provide the flexibility to users to define their readers.
146+
147+
148+
#### A Synthetic Data Reader
149+
150+
Sometimes we want to test a topology and/or a training algorithm using
151+
synthetic data. We can do this by defining the reader a synthesizer:
152+
153+
```python
154+
def read(args):
155+
x = sample_from_uniform(0.0, 1.0)
156+
y = sample_from_gauss(2 * x, sigma)
157+
yield {x, y}, False # no end-of-file so no end-of-pass
158+
```
159+
160+
#### A Reader for Online Learning
161+
162+
Readers can also read an infinite data stream, e.g., a log stream from
163+
a search engine and collected by Kafka:
164+
165+
```python
166+
def read(args):
167+
log_stream = kafka.open_channel(args["kafka channel name"])
168+
yeild log_stream.read(), False # no end-of-pass in online learning
169+
```
170+
171+
### Topology
172+
173+
By default, layers don't have names. But if we want to refer to a
174+
layer later some time, for example, when we do serving using the model
175+
and wants activations/outputs of a layer, we should give it a name.
176+
177+
```python
178+
input = paddle.layer.data(...)
179+
intermediate = paddle.layer.fc(input, name="inter", ...)
180+
output = paddle.layer.softmax(intermediate, name="output", ...)
181+
182+
m = paddle.model.create(output)
183+
e = paddle.evaluator.create(model)
184+
e.forward(read_an_input()) # compute activations of all layers.
185+
print e.activations(layer="inter") # retrieve the activations of layer "inter"
186+
print e.activations(layer="output") # retrieve the activations of layer "output"
187+
```
188+
189+
#### Sharing Parameters
190+
191+
In [above section](#model) we shows a network whose two layers share
192+
the same parameter matrix. To specify such cases, we give "parameter
193+
names" to layers. If some layers have the same paraemter names,
194+
`paddle.model.create` creates a single parameter matrix for these
195+
layers:
196+
197+
```python
198+
text1 = paddle.layer.data(...)
199+
sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection")
200+
text2 = paddle.layer.data(...)
201+
sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection")
202+
out = paddle.layer.cosine(semantic1, semantic2)
203+
```
204+
205+
We can also share parameter matrices between layers in different
206+
models. To do this, we need an additional parameter that refers to a
207+
model:
208+
209+
```python
210+
model1_input = paddle.layer.data(...)
211+
model1_output = paddle.layer.softmax(model1_input, ...,
212+
parameter_name="a_parameter_matrix")
213+
model1 = paddle.model.create(model1_output)
214+
215+
# Another model
216+
model2_semantic = paddle.layer.fc(text2, ...,
217+
parameter_name="a_parameter_matrix",
218+
parameter_model=model1)
219+
```
220+
221+
### Training
222+
223+
The recommended way to training a model is to call `paddle.train`,
224+
which simply calls `paddle.optimizer.Default`, a global variable of
225+
type `paddle.optimizer.SGD`. Equivalently, we can do
226+
227+
```python
228+
opt = paddle.optimizer.SGD(...)
229+
opt.train(model, reader=read, ...)
230+
```
231+
232+
#### Distributed Training
233+
234+
If users want to do distributed training on a cluster, s/he should
235+
call `paddle.dist_train` and provides access tokens to the cluster as
236+
a parameter.
237+
238+
For example, if the user has a TLS certificate that allows him to
239+
access a Kubernetes cluster, s/he should be able to call
240+
241+
```python
242+
paddle.dist_train(model,
243+
reader=read,
244+
optimizer=paddle.optimizer.SGDOptimizer(...),
245+
k8s_user="yi",
246+
k8s_token="kube_cluster_tls.pem",
247+
k8s_job="hello",
248+
num_parameter_servers=15)
249+
```
250+
251+
The pseudo code if `paddle.dist_train` is as follows:
252+
253+
```python
254+
def dist_train():
255+
if os.getenv("KUBERNETES_SERVICE_HOST") == None:
256+
image_name = k8s_user + '/' + k8s_job
257+
docker_build(image_name)
258+
docker_push()
259+
kube_ctrl_start_job(image_name, k8s_user, k8s_token)
260+
else:
261+
rank = kube_list_containers_in_job_and_return_current_containers_rank()
262+
if rank == 0:
263+
master()
264+
elif rank < 15:
265+
parameter_server()
266+
else:
267+
optimizer.train(model, reader=read)
268+
```
269+
270+
Please be aware that if a process is running on the Kubernetes
271+
cluster, it will have some environment variables pre-defined.
272+
273+
If `dist_train` doesn't see these environment variables, it knowns
274+
that it's running on users' personal computer, and it should work as a
275+
*launcher*. Otherwise, it knows that it's running on the cluster and
276+
need to figure out its role as either the master, or a trainer, or a
277+
parameter server.

0 commit comments

Comments
 (0)