Skip to content

Commit e2d5683

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into split_byref_op
2 parents 69188e5 + ebbc28e commit e2d5683

36 files changed

+532
-312
lines changed
File renamed without changes.

doc/fluid/api/initializer.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,45 @@ Xavier
3333
:members:
3434
:noindex:
3535

36+
MSRA
37+
------
38+
39+
.. autoclass:: paddle.fluid.initializer.MSRA
40+
:members:
41+
:noindex:
42+
43+
ConstantInitializer
44+
-------------------
45+
46+
.. autoclass:: paddle.fluid.initializer.ConstantInitializer
47+
:members:
48+
:noindex:
49+
50+
UniformInitializer
51+
------------------
52+
53+
.. autoclass:: paddle.fluid.initializer.UniformInitializer
54+
:members:
55+
:noindex:
56+
57+
NormalInitializer
58+
-----------------
59+
60+
.. autoclass:: paddle.fluid.initializer.NormalInitializer
61+
:members:
62+
:noindex:
63+
64+
XavierInitializer
65+
-----------------
66+
67+
.. autoclass:: paddle.fluid.initializer.XavierInitializer
68+
:members:
69+
:noindex:
70+
MSRA
71+
------
72+
73+
MSRAInitializer
74+
-----------------
75+
.. autoclass:: paddle.fluid.initializer.MSRAInitializer
76+
:members:
77+
:noindex:
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# MPI-enabled PaddlePaddle Design doc
2+
3+
# Background
4+
When we do distribute multi GPU training, the communication overhead between servers become the major bottleneck, because of the following reasons:
5+
1. Must copy at least once from GPU to CPU memory so that the data can be ready to transfer. And for the pserver side, copy data from CPU to GPU introduce more overhead.
6+
2. GPU->CPU data transfer is 10 times slower than data transfer between GPUs or between PCIe devices.
7+
3. TCP connections can not make full use of RDMA 100Gb devices.
8+
9+
We will use OpenMPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle:
10+
1. Enable RDMA with PaddlePaddle, which bring high-performance low latency networks.
11+
2. Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write.
12+
13+
# Change list
14+
* Compile args: Need add compile args to enable MPI support.
15+
* Execute args: Need add execute args to assign when and how to use MPI operations.
16+
* New ops: Need new op ```mpi_send_op``` and ```mpi_listenandserve_op``` to support MPI send and receive.
17+
* Transpiler optimized: Which can add ```mpi_send_op``` and ```mpi_listenandserve_op``` to the running graph.
18+
* MPI utils package: Need MPI utils package as the low-level API supported.
19+
20+
## Compile args
21+
Because MPI or CUDA need hardware supported, so we will add compile args to enable MPI support and control compiling.Add ```WITH_MPI``` compile args to control MPI to use or not. If the ```WITH_MPI``` is ```ON```, compile system will find openMPI codes in configuration. We should prepare openMPI environment before compiling.
22+
23+
## Execute args
24+
Launch the script using the ```mpirun``` launcher, For example: ```mpirun -np 3 -hosts node1,node2,node3 python train.py```. By doing this, We can number the actors (trainer/pserver/master) with o .. (n-1). The node's number is the Rank of the calling process in a group of comm (integer), The MPI processes identify each other using a Rank ID. We have to create a mapping between PaddlePaddle's nodes and their Rank ID so that we can communicate with the correct destinations when using MPI operations.
25+
26+
## New ops
27+
We won't replace all the gRPC requests to MPI requests, the standard gRPC library is used for all administrative operations and the MPI API will be used to transfer tensor or selectRows to Pservers. The base of this idea, we create two new operators to handle requests and receives, the two operators are ```mpi_send_op``` and ```mpi_listenandserve_op```. They are a little similar to [send_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/send_op.cc) and [listen_and_serv_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/listen_and_serv_op.cc), also, We will build a new module to package MPI send and receive process.
28+
29+
### mpi_send_op
30+
Very similar with ```send_op```, we will replace gRPC code which used to send gradient with ```mpi_module```, at the same time, we will wrap it with ```framework::Async```.
31+
32+
### mpi_listenandserve_op
33+
Very similar with ```listen_and_serv_op```, we will replace gRPC code which used to receive gradient with ```mpi_module```, at the same time, we will wrap it with ```framework::Async```.
34+
35+
## Transpiler optimized
36+
**We can get env ```OMPI_COMM_WORLD_SIZE``` and ```OMPI_COMM_WORLD_RANK``` to distinguish use MPI or not, If we use openMPI, the variable in env must exist.**
37+
if confirm to use MPI, we will modify ```send_op``` to ```mpi_send_op``` in distribute_transpiler, and modify ```listenandserve_op``` to ```mpi_listenandserve_op``` also.
38+
39+
## MPI utils package
40+
In this package, We will write openMPI low-level API to use MPI.
41+
The API included in this package are:
42+
* MPI send and receive module, We will build a new module to package MPI send and receive process. MPI send and receive are different to gRPC, the MPI [recvice](https://www.open-mpi.org/doc/v1.8/man3/MPI_Irecv.3.php) must know receive buffer size and receive buffer element. For this reason, We have to make communications twice, the first one is to send metadata about gradient through gRPC, the second one is the real communication through MPI which send gradient data to mpi_listenandserve_op.
43+
The detailed flow is below:
44+
![](https://github.com/seiriosPlus/Paddle/blob/mpi_enabled/doc/fluid/design/dist_train/src/mpi_module.png)
45+
* MPI global configurations, which store the Rank ID and the mapping in global variables, for example:
46+
gRPC client : MPI nodes :``` 127.0.0.1:32004 : 3 ```
104 KB
Loading

paddle/fluid/framework/details/broadcast_op_handle_test.cc

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,9 @@ struct TestBroadcastOpHandle {
7777
local_scopes_[input_scope_idx]->Var("input");
7878

7979
op_handle_.reset(new BroadcastOpHandle(local_scopes_, gpu_list_));
80-
81-
vars_.emplace_back(new VarHandle());
82-
VarHandle* in_var_handle = static_cast<VarHandle*>(vars_.back().get());
83-
in_var_handle->place_ = gpu_list_[input_scope_idx];
84-
in_var_handle->name_ = "input";
85-
in_var_handle->version_ = 1;
86-
in_var_handle->scope_idx_ = input_scope_idx;
87-
in_var_handle->generated_op_ = nullptr;
80+
auto* in_var_handle =
81+
new VarHandle(1, input_scope_idx, "input", gpu_list_[input_scope_idx]);
82+
vars_.emplace_back(in_var_handle);
8883
op_handle_->AddInput(in_var_handle);
8984

9085
// add dummy var
@@ -96,12 +91,8 @@ struct TestBroadcastOpHandle {
9691

9792
for (size_t j = 0; j < gpu_list_.size(); ++j) {
9893
op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
99-
vars_.emplace_back(new VarHandle());
100-
VarHandle* out_var_handle = static_cast<VarHandle*>(vars_.back().get());
101-
out_var_handle->place_ = gpu_list_[j];
102-
out_var_handle->name_ = "out";
103-
out_var_handle->version_ = 2;
104-
out_var_handle->scope_idx_ = j;
94+
VarHandle* out_var_handle = new VarHandle(2, j, "out", gpu_list_[j]);
95+
vars_.emplace_back(out_var_handle);
10596
op_handle_->AddOutput(out_var_handle);
10697
}
10798

paddle/fluid/framework/details/gather_op_handle_test.cc

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -79,13 +79,8 @@ struct TestGatherOpHandle {
7979
// add input
8080
for (size_t j = 0; j < gpu_list_.size(); ++j) {
8181
op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
82-
vars_.emplace_back(new VarHandle());
83-
VarHandle* in_var_handle = static_cast<VarHandle*>(vars_.back().get());
84-
in_var_handle->place_ = gpu_list_[j];
85-
in_var_handle->name_ = "input";
86-
in_var_handle->version_ = 1;
87-
in_var_handle->scope_idx_ = j;
88-
in_var_handle->generated_op_ = nullptr;
82+
auto* in_var_handle = new VarHandle(1, j, "input", gpu_list_[j]);
83+
vars_.emplace_back(in_var_handle);
8984
op_handle_->AddInput(in_var_handle);
9085
}
9186

@@ -97,12 +92,9 @@ struct TestGatherOpHandle {
9792
op_handle_->AddInput(in_dummy_var_handle);
9893

9994
// add output
100-
vars_.emplace_back(new VarHandle());
101-
VarHandle* out_var_handle = static_cast<VarHandle*>(vars_.back().get());
102-
out_var_handle->place_ = gpu_list_[input_scope_idx];
103-
out_var_handle->name_ = "out";
104-
out_var_handle->version_ = 2;
105-
out_var_handle->scope_idx_ = input_scope_idx;
95+
auto* out_var_handle =
96+
new VarHandle(2, input_scope_idx, "out", gpu_list_[input_scope_idx]);
97+
vars_.emplace_back(out_var_handle);
10698
op_handle_->AddOutput(out_var_handle);
10799

108100
// add dummy var

paddle/fluid/framework/details/multi_devices_graph_builder.cc

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -177,13 +177,9 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
177177
auto &prev_grad = vars[vars.size() - 1];
178178
op_handle->AddInput(prev_grad.get());
179179

180-
vars.emplace_back(new VarHandle);
181-
auto &var = vars.back();
182-
var->place_ = p;
183-
var->name_ = og;
184-
var->version_ = vars.size() - 1;
185-
186-
op_handle->AddOutput(var.get());
180+
auto var = new VarHandle(vars.size() - 1, i, og, p);
181+
vars.emplace_back(var);
182+
op_handle->AddOutput(var);
187183
}
188184
#else
189185
PADDLE_ENFORCE("Not implemented");

paddle/fluid/framework/details/ssa_graph_builder.cc

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,8 @@ VarHandle *SSAGraphBuilder::CreateOrGetLatestVarHandle(
5454
auto &var_holder = var_holders[each_var_name];
5555
VarHandle *var = nullptr;
5656
if (var_holder.empty()) {
57-
var_holder.emplace_back(new VarHandle);
58-
auto &init_var = var_holder[0];
59-
init_var->place_ = place;
60-
init_var->name_ = each_var_name;
61-
init_var->generated_op_ = nullptr;
62-
init_var->version_ = 0;
63-
var = init_var.get();
57+
var = new VarHandle(0, place_offset, each_var_name, place);
58+
var_holder.emplace_back(var);
6459
} else {
6560
var = var_holder.rbegin()->get();
6661
}
@@ -73,12 +68,9 @@ void SSAGraphBuilder::CreateOpOutput(SSAGraph *graph, OpHandleBase *op_handle,
7368
size_t place_offset) {
7469
auto &vars = graph->vars_[place_offset][each_var_name];
7570
size_t version = vars.size();
76-
vars.emplace_back(new VarHandle());
77-
auto &var = vars.back();
78-
var->version_ = version;
79-
var->name_ = each_var_name;
80-
var->place_ = place;
81-
op_handle->AddOutput(var.get());
71+
auto var = new VarHandle(version, place_offset, each_var_name, place);
72+
vars.emplace_back(var);
73+
op_handle->AddOutput(var);
8274
}
8375

8476
template <typename Callback>

paddle/fluid/framework/details/var_handle.h

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#include <sstream>
1717
#include <string>
1818
#include <unordered_set>
19+
#include <utility>
1920

2021
#include "paddle/fluid/platform/place.h"
2122

@@ -33,10 +34,10 @@ struct VarHandleBase {
3334

3435
// The operator who generate this variable. nullptr if the variable
3536
// is a root node.
36-
OpHandleBase *generated_op_;
37+
OpHandleBase* generated_op_{nullptr};
3738

3839
// Operators which depend on this variable ready.
39-
std::unordered_set<OpHandleBase *> pending_ops_;
40+
std::unordered_set<OpHandleBase*> pending_ops_;
4041
};
4142

4243
// VarHandle is actually a single version of Runtime Variable.
@@ -47,6 +48,13 @@ struct VarHandleBase {
4748
struct VarHandle : public VarHandleBase {
4849
std::string DebugString() const override;
4950

51+
VarHandle(size_t version, size_t scope_index, std::string name,
52+
platform::Place place)
53+
: version_(version),
54+
scope_idx_(scope_index),
55+
name_(std::move(name)),
56+
place_(std::move(place)) {}
57+
5058
// version field currently is not used, however, just store the version to
5159
// debug easily.
5260
size_t version_;

paddle/fluid/operators/beam_search_decode_op.cc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
1313
limitations under the License. */
1414

1515
#include "paddle/fluid/operators/beam_search_decode_op.h"
16+
#include <string>
1617
#include "paddle/fluid/platform/device_context.h"
1718

1819
namespace paddle {

0 commit comments

Comments
 (0)