Skip to content

Commit b61cf7a

Browse files
committed
Merge branch 'develop' into expand
2 parents 83f4eda + 836e1e0 commit b61cf7a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+1763
-210
lines changed

CMakeLists.txt

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -138,12 +138,6 @@ else()
138138
set(THIRD_PARTY_BUILD_TYPE Release)
139139
endif()
140140

141-
if(WITH_MKL)
142-
option(MKL_SPLIT_GEMM "PaddlePaddle MKL gemm would split to small ones" OFF)
143-
if (MKL_SPLIT_GEMM)
144-
add_definitions(-DPADDLE_MKL_SPLIT_GEMM)
145-
endif()
146-
endif()
147141
set(WITH_MKLML ${WITH_MKL})
148142
if (NOT DEFINED WITH_MKLDNN)
149143
if (WITH_MKL AND AVX2_FOUND)

cmake/external/mkldnn.cmake

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ ExternalProject_Add(
5454
${EXTERNAL_PROJECT_LOG_ARGS}
5555
DEPENDS ${MKLDNN_DEPENDS}
5656
GIT_REPOSITORY "https://github.com/01org/mkl-dnn.git"
57-
GIT_TAG "a29d8487a63afca3d5b8c5bbdbb473cf8ccc6e51"
57+
GIT_TAG "64e03a1939e0d526aa8e9f2e3f7dc0ad8d372944"
5858
PREFIX ${MKLDNN_SOURCES_DIR}
5959
UPDATE_COMMAND ""
6060
CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
# Distributed Training with NCCL2
22

33
We design a pattern that can enable training with `ParallelExecutor` and
4-
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
4+
use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
55
communication library.
66

77
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
88
to do multi GPU training. And if we initialize NCCL2 communicators as
99
ranks in a distributed environment, we can simply run the `ParallelExecutor`
1010
as a distributed program! The only thing that may be different than in
1111
the single node version is that we need to broadcast the NCCL unique ID
12-
to all the nodes, and initialize communicators using that ID, so NCCL2
13-
will know each other as ranks.
12+
to all the nodes and initialize communicators using that ID, so NCCL2
13+
can know each other as ranks.
1414

1515
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
1616
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
17-
what ever platform you like.
17+
whatever platform you like.
1818

19-
It have two running modes:
19+
It has two running modes:
2020

2121
1. Generate and broadcast mode, which should be used on trainer 0;
2222
1. Listen and fetch mode, which should be used on trainers other than 0.
@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
2929
<img src="src/ncc2_design.png">
3030

3131
The above figure indicates the general process when training with NCCL2
32-
distributed. Each trainer have the number of communicators equal to the
32+
distributed. Each trainer has the number of communicators equal to the
3333
number of GPUs, but the ranks should match the global ranks number: here
3434
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
3535
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.

doc/fluid/dev/new_op_cn.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,19 @@
3636
<tbody>
3737
<tr>
3838
<td>OpProtoMake定义 </td>
39-
<td>`.cc`文件,Backward Op不需要定义OpProtoMake </td>
39+
<td>.cc 文件,Backward Op不需要定义OpProtoMake </td>
4040
</tr>
4141
<tr>
4242
<td>Op定义 </td>
43-
<td> `.cc`文件</td>
43+
<td> .cc 文件</td>
4444
</tr>
4545
<tr>
4646
<td>Kernel实现 </td>
47-
<td> CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,CUDA 实现在`.cu`文件中。</td>
47+
<td> CPU、CUDA共享Kernel实现在.h 文件中,否则,CPU 实现在.cc 文件中,CUDA 实现在.cu 文件中。</td>
4848
</tr>
4949
<tr>
5050
<td>注册Op </td>
51-
<td> Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,CUDA实现在`.cu`文件中</td>
51+
<td> Op注册实现在.cc 文件;Kernel注册CPU实现在.cc 文件中,CUDA实现在.cu 文件中</td>
5252
</tr>
5353
</tbody>
5454
</table>
@@ -391,7 +391,7 @@ PADDLE_ENFORCE(ctx->HasInput("X"), "");
391391
```
392392
问题示例2 :提示信息过于简单
393393
```
394-
PADDLE_ENFORCE(i != nullptr, "I must be set"); // I是什么
394+
PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么
395395
```
396396

397397
2. 在报错信息中使用开发人员定义的变量缩写,不易理解!

doc/fluid/howto/cluster/nccl2_rdma_training.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Distributed Training with NCCL2 and RDMA
22

3-
When doing distributed multi-GPU training, network bandwith often becomes the
4-
bottle neck. We introduce a way to use NCCL2 to do such training job to
5-
achieve best performace.
3+
When doing distributed multi-GPU training, network bandwidth often becomes the
4+
bottleneck. We introduce a way to use NCCL2 to do such training job to
5+
achieve best performance.
66

7-
## Prepare Hardwares with RDMA and Multiple GPUs
7+
## Prepare Hardware with RDMA and Multiple GPUs
88

9-
I'm using two Linux servers each of them is installed with 8 GPUs and
9+
I'm using two Linux servers each of them installed with 8 GPUs and
1010
one 100Gb RDMA card.
1111
Base environment is:
1212

@@ -25,15 +25,15 @@ In general, the steps including:
2525
1. Use docker to run tests and make sure GPUs and RDMA can work inside
2626
the container.
2727

28-
I'll ommit section "Install GPU drivers" because we can find it easily
28+
I'll omit the section "Install GPU drivers" because we can find it easily
2929
somewhere else.
3030

3131
### Install RDMA drivers
3232

3333
For my case, I've got two machines with device
3434
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
3535
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
36-
work with latest overlay2 filesystem.
36+
work with the latest overlay2 filesystem.
3737

3838
***NOTE: before you start, make sure you have a way to get a console
3939
of the server other than ssh because we may need to re-configure the
@@ -45,22 +45,22 @@ network device.***
4545
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
4646
1. Run `/etc/init.d/openibd restart` to make everything work, note that
4747
this operation may cause the network goes down if you are using this
48-
RDMA device as default network device and use ssh to login the server.
48+
RDMA device as default network device and use ssh to log in the server.
4949
1. Re-configure the network interface, for example:
5050
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
5151
`ip route add default via 192.168.16.1 dev eth2`.
5252
1. Do the same thing on the other node.
5353
1. Use `ping` to test if the two nodes have typical ICMP connection.
5454
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
55-
ready and have the desired bandwith.
55+
ready and have the desired bandwidth.
5656

5757
### Prepare Docker Image to Run RDMA Programs
5858

5959
1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
6060
package in it.
6161
1. Start a docker container and mount GPU driver libs into it (you can
6262
skip this step if you are using nvidia-docker).
63-
1. Mount RDMA dirvers and libs into the docker image (see below section),
63+
1. Mount RDMA drivers and libs into the docker image (see below section),
6464
also `udaddy` and `ib_write_bw` if needed.
6565
1. Mount GPU devices and RDMA devices into the container using `--device`
6666
or just use privileged mode `--privileged`.

paddle/fluid/API.spec

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ paddle.fluid.layers.crop ArgSpec(args=['x', 'shape', 'offsets', 'name'], varargs
162162
paddle.fluid.layers.rank_loss ArgSpec(args=['label', 'left', 'right', 'name'], varargs=None, keywords=None, defaults=(None,))
163163
paddle.fluid.layers.prelu ArgSpec(args=['x', 'mode', 'param_attr', 'name'], varargs=None, keywords=None, defaults=(None, None))
164164
paddle.fluid.layers.flatten ArgSpec(args=['x', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None))
165+
paddle.fluid.layers.stack ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,))
165166
paddle.fluid.layers.data ArgSpec(args=['name', 'shape', 'append_batch_size', 'dtype', 'lod_level', 'type', 'stop_gradient'], varargs=None, keywords=None, defaults=(True, 'float32', 0, VarType.LOD_TENSOR, True))
166167
paddle.fluid.layers.open_recordio_file ArgSpec(args=['filename', 'shapes', 'lod_levels', 'dtypes', 'pass_num', 'for_parallel'], varargs=None, keywords=None, defaults=(1, True))
167168
paddle.fluid.layers.open_files ArgSpec(args=['filenames', 'shapes', 'lod_levels', 'dtypes', 'thread_num', 'buffer_size', 'pass_num', 'is_test'], varargs=None, keywords=None, defaults=(None, None, 1, None))

paddle/fluid/framework/array.h

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
#pragma once
16+
17+
#include <cstdint>
18+
#include "paddle/fluid/platform/hostdevice.h"
19+
20+
namespace paddle {
21+
namespace framework {
22+
template <typename T, size_t N>
23+
class Array {
24+
static_assert(N > 0, "The size of array must be larger than 0");
25+
26+
public:
27+
HOSTDEVICE Array() {}
28+
29+
HOSTDEVICE explicit Array(const T &val) {
30+
for (size_t i = 0; i < N; ++i) data_[i] = val;
31+
}
32+
33+
HOSTDEVICE const T *Get() const { return data_; }
34+
35+
HOSTDEVICE T *GetMutable() { return data_; }
36+
37+
HOSTDEVICE T &operator[](size_t index) { return data_[index]; }
38+
39+
HOSTDEVICE const T &operator[](size_t index) const { return data_[index]; }
40+
41+
HOSTDEVICE constexpr size_t size() const { return N; }
42+
43+
private:
44+
T data_[N];
45+
};
46+
47+
} // namespace framework
48+
} // namespace paddle

paddle/fluid/framework/op_proto_maker.cc

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -129,10 +129,6 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto,
129129
"Optimized for variable")
130130
.SetDefault({});
131131

132-
AddAttr<std::vector<std::string>>(OpCreationCallstackAttrName(),
133-
"Callstack for Op Creatation.")
134-
.SetDefault({});
135-
136132
Validate();
137133
}
138134

paddle/fluid/framework/op_proto_maker.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@ class OpProtoAndCheckerMaker {
3939
public:
4040
static const char *OpRoleAttrName() { return "op_role"; }
4141
static const char *OpRoleVarAttrName() { return "op_role_var"; }
42-
static const char *OpCreationCallstackAttrName() { return "op_callstack"; }
4342

4443
void operator()(proto::OpProto *proto, OpAttrChecker *attr_checker);
4544

paddle/fluid/framework/operator.cc

Lines changed: 15 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,15 @@ distributed under the License is distributed on an "AS IS" BASIS,
1111
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
See the License for the specific language governing permissions and
1313
limitations under the License. */
14-
#include "paddle/fluid/framework/operator.h"
14+
#include <gflags/gflags.h>
15+
#include <glog/logging.h>
16+
1517
#include <algorithm>
16-
#include <sstream>
17-
#include <string>
18-
#include <vector>
19-
#include "gflags/gflags.h"
20-
#include "glog/logging.h"
18+
2119
#include "paddle/fluid/framework/data_transform.h"
2220
#include "paddle/fluid/framework/executor.h"
2321
#include "paddle/fluid/framework/lod_tensor.h"
24-
#include "paddle/fluid/framework/op_proto_maker.h"
22+
#include "paddle/fluid/framework/operator.h"
2523
#include "paddle/fluid/framework/shape_inference.h"
2624
#include "paddle/fluid/framework/var_type.h"
2725
#include "paddle/fluid/platform/profiler.h"
@@ -129,48 +127,19 @@ static LoD GetLoD(const Scope& scope, const std::string& name) {
129127
}
130128

131129
void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
132-
try {
133-
if (VLOG_IS_ON(4)) {
134-
VLOG(4) << place << " " << DebugStringEx(&scope);
135-
}
136-
if (platform::is_gpu_place(place)) {
130+
VLOG(4) << place << " " << DebugStringEx(&scope);
131+
if (platform::is_gpu_place(place)) {
137132
#ifndef PADDLE_WITH_CUDA
138-
PADDLE_THROW("Cannot run operator on place %s", place);
133+
PADDLE_THROW("Cannot run operator on place %s", place);
139134
#else
140-
auto dev_id = boost::get<platform::CUDAPlace>(place).device;
141-
platform::SetDeviceId(dev_id);
135+
auto dev_id = boost::get<platform::CUDAPlace>(place).device;
136+
platform::SetDeviceId(dev_id);
142137
#endif
143-
}
144-
platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
145-
platform::RecordEvent record_event(Type(), pool.Get(place));
146-
RunImpl(scope, place);
147-
if (VLOG_IS_ON(3)) {
148-
VLOG(3) << place << " " << DebugStringEx(&scope);
149-
}
150-
} catch (platform::EnforceNotMet exception) {
151-
if (Attrs().count("sub_block") != 0) {
152-
throw exception;
153-
}
154-
155-
auto& callstack = Attr<std::vector<std::string>>(
156-
OpProtoAndCheckerMaker::OpCreationCallstackAttrName());
157-
158-
if (callstack.empty()) {
159-
throw exception;
160-
}
161-
std::ostringstream sout;
162-
sout << "Invoke operator " << Type() << " error.\n";
163-
sout << "Python Callstacks: \n";
164-
for (auto& line : callstack) {
165-
sout << line;
166-
}
167-
sout << "C++ Callstacks: \n";
168-
sout << exception.err_str_;
169-
exception.err_str_ = sout.str();
170-
throw exception;
171-
} catch (...) {
172-
std::rethrow_exception(std::current_exception());
173138
}
139+
platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
140+
platform::RecordEvent record_event(Type(), pool.Get(place));
141+
RunImpl(scope, place);
142+
VLOG(3) << place << " " << DebugStringEx(&scope);
174143
}
175144

176145
bool OperatorBase::HasInputs(const std::string& name) const {
@@ -198,7 +167,7 @@ const std::vector<std::string>& OperatorBase::Inputs(
198167
}
199168

200169
bool OperatorBase::HasOutputs(const std::string& name) const {
201-
if (outputs_.end() != outputs_.find(name)) {
170+
if (outputs_.find(name) != outputs_.end()) {
202171
return true;
203172
} else {
204173
return false;

0 commit comments

Comments
 (0)