Skip to content

Commit f18016b

Browse files
committed
Resolve conflicts.
2 parents 0358fd0 + 7081f21 commit f18016b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+2167
-239
lines changed

doc/api/v2/fluid/layers.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ dynamic_lstm
1818
.. autofunction:: paddle.v2.fluid.layers.dynamic_lstm
1919
:noindex:
2020

21+
dynamic_gru
22+
-----------
23+
.. autofunction:: paddle.v2.fluid.layers.dynamic_gru
24+
:noindex:
25+
2126
data
2227
----
2328
.. autofunction:: paddle.v2.fluid.layers.data
@@ -500,6 +505,11 @@ swish
500505
.. autofunction:: paddle.v2.fluid.layers.swish
501506
:noindex:
502507

508+
im2sequence
509+
------
510+
.. autofunction:: paddle.v2.fluid.layers.im2sequence
511+
:noindex:
512+
503513
edit_distance
504514
---------------
505515
.. autofunction:: paddle.v2.fluid.layers.edit_distance_error

doc/design/csp.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Design Doc: CSP in PaddlePaddle Fluid
2+
3+
## Motivation
4+
5+
Concurrent programming is important for deep learning. Few example applications are:
6+
7+
1. The main thread keeps reading the next mini-batch while another thread uses the GPU for computing.
8+
2. The main thread performs the computation while another thread uploads the local gradients from each trainer to the parameter server.
9+
10+
Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept of a graph at all, as the design goal of Fluid is that of a programming language.
11+
12+
## Concurrent Programming Models
13+
14+
There were many concurrent programming models, implemented in various forms:
15+
16+
| concurrent programming model | implementation |
17+
|-----|-----|
18+
| mutex | types and functions in standard libraries |
19+
| semaphore | types and functions in standard libraries |
20+
| communicating sequential processes (CSP) | Go programming language |
21+
| actor model | Erlang programming language |
22+
| message passing | MPI |
23+
| bulk synchronous parallel (BSP) | Pregel distributed programming framework |
24+
25+
Since Fluid was designed to be a programming language, we would like to implement CSP in Fluid.
26+
27+
### CSP v.s. Actor Model
28+
29+
A well-known implementation of Actor Model is the Erlang programming language. In Actor Model, *processes* could send messages to another process and receive messages from another process given the process IDs. We can find the three ingredients, process with ID, send, and recv, in MPI too. Indeed, we can rewrite Erlang programs in Python + MPI with possibly fewer lines of code. Our concern with Actor Model is that it doesn't seem reasonable to implement process management in a programming language's runtime library; instead, it should be the operating systems' responsibility to manage processes and libraries like MPI for send/recv.
30+
31+
## CSP in Fluid
32+
33+
Fluid has two fundamental control-flows: *if-else* and *while*. If we are to implement CSP, we need the following:
34+
35+
1. a new data type: *channel* and operators *send* and *recv*,
36+
1. *goroutine* or thread, and
37+
1. a new control-flow: select.
38+
39+
We also need Python wrappers for the above components.
40+
41+
The type *channel* is conceptually the blocking queue. In Go, its implemented is a [blocking circular queue](https://github.com/golang/go/blob/68ce117cf17b8debf5754bfd476345779b5b6616/src/runtime/chan.go#L31-L50), which supports send and recv.
42+
43+
The `select` operation has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They monitor multiple file descriptors to see if I/O is possible on any of them. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do the same in O(1) time. In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.
44+
45+
It might be a good idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so we could focus on Python binding and the syntax.
46+
47+
### Type Channel
48+
49+
Fluid supports many data types:
50+
51+
1. Tensor,
52+
1. Row-sparse Tensor
53+
1. LoD Tensor,
54+
1. Tensor array, etc
55+
56+
Each data type is registered in the [`framework.proto`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L117-L127) as an enum value. To add a new type channel, we need to add a new type enum.
57+
58+
To expose a C++ type to Python, we need to edit the [`pybind.cc`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/pybind.cc) file. [Here](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/pybind.cc#L120-L164) is an example how we expose C++ class LoDTensor.
59+
60+
## Syntax Design
61+
62+
### Create Channel
63+
64+
In Go, we create a channel by specifying the element type and buffer size:
65+
66+
```go
67+
ch := make(chan int) // a channel without buffer
68+
ch1 := make(chan int, 100) // a channel that can buffer 100 ints.
69+
```
70+
71+
In Fluid, we should be able to do the same:
72+
73+
```python
74+
ch = fluid.make_chan(dtype=INT)
75+
ch1 = fluid.make_chan(dtype=INT, 100)
76+
```
77+
78+
In addition to that, we want channels that can hold more complex element types, e.g., Tensors of float16:
79+
80+
```python
81+
ch = fluid.make_chan(dtype=Tensor, etype=float16)
82+
```
83+
84+
or Tensors of Tensors of float16 etc.
85+
86+
The point here is that we need a consistent way to compose types, like in C++ we can have `Tensor<Tensor<...<float16>...> >`.
87+
88+
### Send and Recv
89+
90+
### Select
91+
92+
## Example Programs
93+
94+
### 1. RPC between Trainers and Parameter Servers
95+
96+
### 2. Concurrent Minibatch Loading

doc/design/dist_refactor/parameter_server.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,16 @@ different purposes.
99

1010
## Background
1111

12-
The previous implementations of the parameter server does not run a
12+
The previous implementations of the parameter server do not run a
1313
fluid sub-program. Parameter initialization, optimizer computation, network
1414
communication and checkpointing are implemented twice on both the
15-
trainer and the parameter server.
15+
trainer as well as the parameter server.
1616

17-
It would be great if we can write code once and use them on both the
18-
trainer and the parameter server: reduces code duplication and
19-
improves extensibility. Given that after the current refactor, we are
20-
representing everything as a computing graph on the
21-
trainer. Representing everything as a computing graph on the parameter
17+
It would be great if we can write code once and use them on both: the
18+
trainer and the parameter server, since this reduces code duplication and
19+
improves extensibility. Given that after the current refactoring, we are
20+
representing everything as a computation graph on the
21+
trainer. Representing everything as a computation graph on the parameter
2222
server becomes a natural extension.
2323

2424
## Design
@@ -30,9 +30,9 @@ into sub-programs to be scheduled on different nodes with the following
3030
steps:
3131

3232
1. OP placement: the OPs will be placed on different nodes according
33-
to heuristic that minimizes estimated total computation
33+
to a heuristic that minimizes the estimated total computation
3434
time. Currently we will use a simple heuristic that puts parameter
35-
varable on parameter server workers and everything else on trainer
35+
variable on parameter server workers and everything else on trainer
3636
workers.
3737
1. Add communication OPs to enable the communication between nodes.
3838

@@ -47,22 +47,22 @@ After converting:
4747

4848
<img src="src/dist-graph.png" width="700"/>
4949

50-
1. The parameter variable W and it's optimizer program are placed on the parameter server.
50+
1. The parameter variable W and its optimizer program are placed on the parameter server.
5151
1. Operators are added to the program.
5252
- *Send* sends data to the connected *Recv* operator. The
5353
scheduler on the receive node will only schedule *Recv* operator
5454
to run when the *Send* operator has ran (the *Send* OP will mark
5555
the *Recv* OP runnable automatically).
56-
- *Enueue* enqueues the input variable, it can block until space
56+
- *Enqueue* enqueues the input variable, it can block until space
5757
become available in the queue.
5858
- *Dequeue* outputs configurable numbers of tensors from the
59-
queue. It will block until the queue have the required number of
59+
queue. It will block until the queue has the required number of
6060
tensors.
6161

6262

6363
### Benefits
6464

65-
- Model parallelism become easier to implement: it's an extension to
65+
- Model parallelism becomes easier to implement: it is an extension to
6666
the trainer - parameter server approach. We can have several "Transpilers"
6767
to achieve different goals.
6868
- User-defined optimizer is easier to add - user can now express it as
@@ -72,22 +72,22 @@ After converting:
7272

7373
### Challenges
7474

75-
- It's important to balance the parameter shards of on multiple
76-
parameter server. If a single parameter is very big (some
75+
- It is important to balance the parameter shards on multiple
76+
parameter servers. If a single parameter is very big (for example: some
7777
word-embedding, fully connected, softmax layer), we need to
7878
automatically partition the single parameter onto different
7979
parameter servers when possible (only element-wise optimizer depends
8080
on the parameter variable).
81-
- In the "Aync SGD" figure, the "W" variable on the parameter server
82-
could be read and wrote concurrently. See
81+
- In the "Async SGD" figure, the "W" variable on the parameter server
82+
could be read and written concurrently. See
8383
[here](https://github.com/PaddlePaddle/Paddle/pull/6394) for more
84-
details about concurrent program in fluid.
84+
details about concurrent program in Fluid.
8585

8686
### Discussion
8787

8888
- Can the Enqueue OP be implemented under our current tensor design
89-
(puts the input tensor into the queue tensor)?
90-
- *Dequeue* OP will have variable numbers of output (depends on the
89+
(put the input tensor into the queue tensor)?
90+
- *Dequeue* OP will have variable numbers of output (depending on the
9191
`min_count` attribute), does our current design support it? (similar
9292
question for the *Add* OP)
9393

doc/design/ops/sequence_decoder.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The current `LoDTensor` is designed to store levels of variable-length sequences
2222
The integers in each level represent the begin and end (not inclusive) offset of a sequence **in the underlying tensor**,
2323
let's call this format the **absolute-offset LoD** for clarity.
2424

25-
The relative-offset LoD can retrieve any sequence very quickly but fails to represent empty sequences, for example, a two-level LoD is as follows
25+
The absolute-offset LoD can retrieve any sequence very quickly but fails to represent empty sequences, for example, a two-level LoD is as follows
2626
```python
2727
[[0, 3, 9]
2828
[0, 2, 3, 3, 3, 9]]
@@ -119,7 +119,7 @@ def generate():
119119
encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word)
120120
decoder_input = pd.fc(
121121
act=pd.activation.Linear(),
122-
input=[target_word, encoder_ctx],
122+
input=[target_word, encoder_ctx_expanded],
123123
size=3 * decoder_dim)
124124
gru_out, cur_mem = pd.gru_step(
125125
decoder_input, mem=decoder_mem, size=decoder_dim)

doc/getstarted/build_and_install/docker_install_cn.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,14 @@
2525

2626
.. code-block:: bash
2727
28-
docker pull docker.paddlepaddle.org/paddle
28+
docker pull docker.paddlepaddlehub.com/paddle
2929
3030
下载GPU版本(cuda8.0_cudnn5_avx_mkl)的Docker镜像:
3131

3232
.. code-block:: bash
3333
3434
docker pull paddlepaddle/paddle:latest-gpu
35-
docker pull docker.paddlepaddle.org/paddle:latest-gpu
35+
docker pull docker.paddlepaddlehub.com/paddle:latest-gpu
3636
3737
选择下载使用不同的BLAS库的Docker镜像:
3838

@@ -49,7 +49,7 @@
4949
5050
docker pull paddlepaddle/paddle:[tag]
5151
# 比如:
52-
docker pull docker.paddlepaddle.org/paddle:0.10.0-gpu
52+
docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu
5353
5454
.. _docker_run:
5555

doc/getstarted/build_and_install/docker_install_en.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,14 @@ For users in China, we provide a faster mirror:
2626

2727
.. code-block:: bash
2828
29-
docker pull docker.paddlepaddle.org/paddle
29+
docker pull docker.paddlepaddlehub.com/paddle
3030
3131
Download GPU version (cuda8.0_cudnn5_avx_mkl) images:
3232

3333
.. code-block:: bash
3434
3535
docker pull paddlepaddle/paddle:latest-gpu
36-
docker pull docker.paddlepaddle.org/paddle:latest-gpu
36+
docker pull docker.paddlepaddlehub.com/paddle:latest-gpu
3737
3838
Choose between different BLAS version:
3939

@@ -53,7 +53,7 @@ and run:
5353
5454
docker pull paddlepaddle/paddle:[tag]
5555
# i.e.
56-
docker pull docker.paddlepaddle.org/paddle:0.10.0-gpu
56+
docker pull docker.paddlepaddlehub.com/paddle:0.11.0-gpu
5757
5858
.. _docker_run:
5959

doc/howto/optimization/cpu_profiling.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,7 @@ each column is as follows:
6060
| column | meaning |
6161
| --- | --- |
6262
| ncalls | the number of calls into a function |
63-
| tottime | the total execution time of the function, not including the
64-
execution time of other functions called by the function |
63+
| tottime | the total execution time of the function, not including the execution time of other functions called by the function |
6564
| percall | tottime divided by ncalls |
6665
| cumtime | the total execution time of the function, including the execution time of other functions being called |
6766
| percall | cumtime divided by ncalls |

paddle/framework/attribute.cc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ Attribute GetAttrValue(const proto::OpDesc::Attr& attr_desc) {
6161
}
6262
return val;
6363
}
64+
case proto::AttrType::LONG: {
65+
return attr_desc.l();
66+
}
6467
default:
6568
PADDLE_THROW("Unsupport attr type %d", attr_desc.type());
6669
}

paddle/framework/attribute.h

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,32 @@ struct ExtractAttribute<bool> {
168168
const std::string& attr_name_;
169169
};
170170

171+
template <>
172+
struct ExtractAttribute<int64_t> {
173+
explicit ExtractAttribute(const std::string& attr_name)
174+
: attr_name_(attr_name) {}
175+
176+
int64_t* operator()(Attribute& attr) const {
177+
if (attr.type() == typeid(int)) { // NOLINT
178+
int val = boost::get<int>(attr);
179+
attr = static_cast<int64_t>(val);
180+
} else if (attr.type() == typeid(float)) { // NOLINT
181+
int val = boost::get<float>(attr);
182+
attr = static_cast<int64_t>(val);
183+
}
184+
int64_t* attr_value = nullptr;
185+
try {
186+
attr_value = &boost::get<int64_t>(attr);
187+
} catch (boost::bad_get& bad_get) {
188+
PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s",
189+
attr_name_, attr.type().name());
190+
}
191+
return attr_value;
192+
}
193+
194+
const std::string& attr_name_;
195+
};
196+
171197
// check whether a certain attribute fit its limits
172198
// an attribute can have more than one limits
173199
template <typename T>

paddle/framework/block_desc.cc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ std::vector<VarDesc *> BlockDesc::AllVars() const {
7575

7676
OpDesc *BlockDesc::AppendOp() {
7777
need_update_ = true;
78-
ops_.emplace_back(new OpDesc());
78+
ops_.emplace_back(new OpDesc(this));
7979
return ops_.back().get();
8080
}
8181

@@ -86,7 +86,7 @@ void BlockDesc::AppendAllocatedOp(std::unique_ptr<OpDesc> &&op_desc) {
8686

8787
OpDesc *BlockDesc::PrependOp() {
8888
need_update_ = true;
89-
ops_.emplace_front(new OpDesc());
89+
ops_.emplace_front(new OpDesc(this));
9090
return ops_.front().get();
9191
}
9292

@@ -153,7 +153,7 @@ BlockDesc::BlockDesc(ProgramDesc *prog, proto::BlockDesc *desc)
153153
vars_[var_desc.name()].reset(new VarDesc(var_desc));
154154
}
155155
for (const proto::OpDesc &op_desc : desc_->ops()) {
156-
ops_.emplace_back(new OpDesc(op_desc, prog));
156+
ops_.emplace_back(new OpDesc(op_desc, prog, this));
157157
}
158158
}
159159

@@ -162,7 +162,7 @@ BlockDesc::BlockDesc(const BlockDesc &other, proto::BlockDesc *desc,
162162
: prog_(prog), desc_(desc) {
163163
need_update_ = true;
164164
for (auto &op : other.ops_) {
165-
ops_.emplace_back(new OpDesc(*op));
165+
ops_.emplace_back(new OpDesc(*op, this));
166166
}
167167

168168
for (auto &it : other.vars_) {

0 commit comments

Comments
 (0)