Skip to content

Commit 24d51de

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dev_op_tensor_support
2 parents 27df3a9 + b2435a3 commit 24d51de

36 files changed

+2131
-88
lines changed

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ option(REPLACE_ENFORCE_GLOG "Replace PADDLE_ENFORCE with glog/CHECK for better d
6565
option(WITH_ANAKIN "Compile with Anakin library" OFF)
6666
option(WITH_GRPC "Use grpc as the default rpc framework" ${WITH_DISTRIBUTE})
6767
option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF)
68+
option(WITH_INFERENCE "Compile fluid inference library" ON)
6869
option(WITH_SYSTEM_BLAS "Use system blas library" OFF)
6970
option(PY_VERSION "Compile PaddlePaddle with python3 support" ${PY_VERSION})
7071

cmake/generic.cmake

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,8 @@ function(cc_test TARGET_NAME)
264264
WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})
265265
if (${cc_test_SERIAL})
266266
set_property(TEST ${TARGET_NAME} PROPERTY RUN_SERIAL 1)
267+
268+
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cpu_deterministic=true)
267269
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_init_allocated_mem=true)
268270
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cudnn_deterministic=true)
269271
endif()
@@ -330,6 +332,8 @@ function(nv_test TARGET_NAME)
330332
add_test(${TARGET_NAME} ${TARGET_NAME})
331333
if (nv_test_SERIAL)
332334
set_property(TEST ${TARGET_NAME} PROPERTY RUN_SERIAL 1)
335+
336+
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cpu_deterministic=true)
333337
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_init_allocated_mem=true)
334338
set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cudnn_deterministic=true)
335339
endif()
@@ -580,6 +584,7 @@ function(py_test TARGET_NAME)
580584
cmake_parse_arguments(py_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
581585
add_test(NAME ${TARGET_NAME}
582586
COMMAND env FLAGS_init_allocated_mem=true FLAGS_cudnn_deterministic=true
587+
FLAGS_cpu_deterministic=true
583588
PYTHONPATH=${PADDLE_BINARY_DIR}/python ${py_test_ENVS}
584589
${PYTHON_EXECUTABLE} -u ${py_test_SRCS} ${py_test_ARGS}
585590
WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})

doc/survey/op_fusion_design.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Operator fusion
2+
Fusing multiple operators together is an important method to optimize the program execution, particularly for GPU or other specialized accelerators. An obvious benefit is to avoid the overhead of saving the intermediate result back into global memory.
3+
4+
There are generally two ways to fuse operators, fusing directly connected operators and fusing non directly connected operators. The first method is mainly used by [NNVM Compiler](https://github.com/dmlc/tvm/) and [XLA](https://www.tensorflow.org/performance/xla/). The second method is mainly used by Dynet and TensorFlow Fold to do auto-batching. The principle of fusing operator is according to some rules to combine multiple operations into one, for example, `Y = X * W` and `Z = Y + B` can be fused to `Z = X * W + B`, and `Y1 = X1 * W` and `Y2 = X2 * W` can be fused to `[Y1;Y2] = [X1;X2] * W`. In order to get a short-term profit, we decided to try to manually specify these rules.
5+
6+
## Challenge
7+
The challenge of fusing operators is:
8+
- how to make the rules.
9+
- how to implement these rules efficiently.
10+
11+
### How to make the rules?
12+
13+
The problem of determining the best single location for a fusion operator is an NP-hard combinatorial problem. After analysis the operators of the DL model, we found there are two group of operators can be fused explicitly, one is the simple and adjacent operations, for example, `tmp = x + y` and `z = Relu(tmp)`, and the other is the operators that have the same function, for example, a serials of `SGD` or `Momentum`. They usually appear in the model in a large number. So we should think about how to fuse them separately first.
14+
15+
### How to implement these rules efficiently?
16+
#### How to fuse the adjacent operations efficiently?
17+
Here we use a template function to represent the fused operations. The pros of using a template function are that it is simple and efficient, and the cons are that it is not easy to expand, and it can only be used to express some simple operations. So taking into account our current needs, the template function is more appropriate.
18+
19+
#### How to fuse the operators that have the same function efficiently?
20+
We take SGD operator as an example, the training model may have hundreds of parameters and correspondingly have the same number of SGD operators. The expression(`w = w - lr*w_g`) of those operators is the same, so during of training, the executor will execute this expression hundreds time in CPU or other specialized accelerators. If we can fuse them and make the address of all `w` and all `w_g` continuous respectively, we only need execute one time. For some accelerators, the time of launching kernel is not neglected, so the time of hundreds of times of launching and executing kernel may be larger than launching and executing only once. There usually are many operators that similar to `SGD` in the DL model, such as `AllReduce` and `FC`.

paddle/fluid/API.spec

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -336,6 +336,7 @@ paddle.fluid.contrib.BeamSearchDecoder.decode ArgSpec(args=['self'], varargs=Non
336336
paddle.fluid.contrib.BeamSearchDecoder.early_stop ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
337337
paddle.fluid.contrib.BeamSearchDecoder.read_array ArgSpec(args=['self', 'init', 'is_ids', 'is_scores'], varargs=None, keywords=None, defaults=(False, False))
338338
paddle.fluid.contrib.BeamSearchDecoder.update_array ArgSpec(args=['self', 'array', 'value'], varargs=None, keywords=None, defaults=None)
339+
paddle.fluid.contrib.memory_usage ArgSpec(args=['program', 'batch_size'], varargs=None, keywords=None, defaults=None)
339340
paddle.fluid.transpiler.DistributeTranspiler.__init__ ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,))
340341
paddle.fluid.transpiler.DistributeTranspiler.create_splited_vars ArgSpec(args=['self', 'source_var', 'block', 'tag'], varargs=None, keywords=None, defaults=None)
341342
paddle.fluid.transpiler.DistributeTranspiler.get_pserver_program ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None)

paddle/fluid/CMakeLists.txt

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,7 @@ add_subdirectory(operators)
55
add_subdirectory(pybind)
66
add_subdirectory(string)
77
add_subdirectory(recordio)
8-
# NOTE: please add subdirectory inference at last.
9-
add_subdirectory(inference)
8+
if(WITH_INFERENCE)
9+
# NOTE: please add subdirectory inference at last.
10+
add_subdirectory(inference)
11+
endif()

paddle/fluid/framework/details/build_strategy.h

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,26 @@ namespace framework {
2121
namespace details {
2222

2323
struct BuildStrategy {
24+
// ParallelExecutor supports two modes of ReduceStrategy, kAllReduce and
25+
// kReduce, for CPU and GPU. If you use kAllReduce, different threads
26+
// optimize their parameters separately. If you use kReduce, the optimizations
27+
// of parameters are distributed to different threads.
28+
// For example, a model has 100 parameters and is running with four threads,
29+
// if you choose kAllReduce, every thread is to optimize 100 parameters
30+
// separately, if you choose kReduce, every thread is to optimize 25
31+
// parameters.
32+
// Of particular note is, if you use kReduce when using CPU training,
33+
// all the parameters are shared between different threads. This feature will
34+
// save memory.
35+
// FIXME(zcd): The result of the two modes(kAllReduce and kReduce) maybe not
36+
// equal for GPU. Because, the result of the different order of summing maybe
37+
// different, for example, the result of `a+b+c+d` may be different with the
38+
// result of `c+a+b+d`.
39+
// For GPU, the implementation of kAllReduce and kReduce is adopted NCCL,
40+
// so the result of kAllReduce and kReduce maybe not equal.
41+
// For CPU, if you want to fix the order of summing to make the result
42+
// of kAllReduce and kReduce no diff, you can add
43+
// `FLAGS_cpu_deterministic=true` to env.
2444
enum class ReduceStrategy { kAllReduce = 0, kReduce = 1 };
2545

2646
enum class GradientScaleStrategy {

paddle/fluid/framework/details/multi_devices_graph_builder.cc

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,8 @@ std::unique_ptr<ir::Graph> MultiDevSSAGraphBuilder::ApplyImpl(
275275
if (strategy_.gradient_scale_ !=
276276
BuildStrategy::GradientScaleStrategy::kCustomized) {
277277
// TODO(paddle-dev): Why is there no input for this op_handle?
278-
CreateScaleLossGradOp(&result);
278+
auto loss_grad_name = node->Op()->OutputArgumentNames()[0];
279+
CreateScaleLossGradOp(&result, loss_grad_name);
279280
}
280281
// This assumes the backward generating code will ensure IsScaleLossOp
281282
// is true only for the op that scale the final scalar loss.
@@ -535,7 +536,8 @@ int MultiDevSSAGraphBuilder::GetVarDeviceID(const ir::Graph &graph,
535536
return got == sharded_var_device.end() ? -1 : got->second;
536537
}
537538

538-
void MultiDevSSAGraphBuilder::CreateScaleLossGradOp(ir::Graph *result) const {
539+
void MultiDevSSAGraphBuilder::CreateScaleLossGradOp(
540+
ir::Graph *result, const std::string &loss_grad_name) const {
539541
for (size_t i = 0; i < places_.size(); ++i) {
540542
// Insert ScaleCost OpHandle
541543
#ifdef PADDLE_WITH_CUDA
@@ -558,10 +560,10 @@ void MultiDevSSAGraphBuilder::CreateScaleLossGradOp(ir::Graph *result) const {
558560
// loss->pending_ops_.emplace_back(op_handle);
559561
// op_handle->inputs_.emplace_back(loss);
560562

561-
CreateOpOutput(result, op_handle,
562-
result->CreateEmptyNode(GradVarName(loss_var_name_),
563-
ir::Node::Type::kVariable),
564-
places_[i], i);
563+
CreateOpOutput(
564+
result, op_handle,
565+
result->CreateEmptyNode(loss_grad_name, ir::Node::Type::kVariable),
566+
places_[i], i);
565567
}
566568
}
567569

paddle/fluid/framework/details/multi_devices_graph_builder.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,9 @@ class MultiDevSSAGraphBuilder : public SSAGraphBuilder {
7575
void CreateComputationalOps(ir::Graph *result, ir::Node *node,
7676
size_t num_places) const;
7777

78-
void CreateScaleLossGradOp(ir::Graph *result) const;
78+
void CreateScaleLossGradOp(ir::Graph *result,
79+
const std::string &loss_grad_name) const;
80+
7981
VarHandle *CreateReduceOp(ir::Graph *result, const std::string &og,
8082
int dst_dev_id) const;
8183
void CreateComputationalOp(ir::Graph *result, ir::Node *node,

paddle/fluid/framework/details/reduce_op_handle.cc

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@
1818
#include "paddle/fluid/framework/details/variable_visitor.h"
1919
#include "paddle/fluid/platform/profiler.h"
2020

21+
DEFINE_bool(
22+
cpu_deterministic, false,
23+
"Whether to make the result of computation deterministic in CPU side.");
24+
2125
namespace paddle {
2226
namespace framework {
2327
namespace details {
@@ -91,11 +95,33 @@ void ReduceOpHandle::RunImpl() {
9195
} else {
9296
std::vector<const LoDTensor *> lod_tensors =
9397
GetInputValues<LoDTensor>(in_var_handles, var_scopes);
98+
9499
if (paddle::platform::is_cpu_place(lod_tensors[0]->place())) {
95100
this->RunAndRecordEvent([&] {
96-
ReduceLoDTensor func(lod_tensors,
97-
out_var->GetMutable<framework::LoDTensor>());
98-
VisitDataType(ToDataType(lod_tensors[0]->type()), func);
101+
// FIXME(zcd): The order of summing is important,
102+
// especially when the type of data is float or double.
103+
// For example, the result of `a+b+c+d` may be different
104+
// with the result of `c+a+b+d`, so the summing order should be fixed.
105+
if (!FLAGS_cpu_deterministic) {
106+
ReduceLoDTensor func(lod_tensors,
107+
out_var->GetMutable<framework::LoDTensor>());
108+
VisitDataType(ToDataType(lod_tensors[0]->type()), func);
109+
} else {
110+
// We sum lod_tensors to reduce_sum_trg which is in local_scopes_0
111+
// here, but it doesn't mean reduce_sum_trg must be in local_scopes_0.
112+
auto &reduce_sum_trg = *this->local_scopes_[0]
113+
->FindVar(kLocalExecScopeName)
114+
->Get<Scope *>()
115+
->FindVar(out_var_handle->name_)
116+
->GetMutable<framework::LoDTensor>();
117+
ReduceLoDTensor func(lod_tensors, &reduce_sum_trg);
118+
VisitDataType(ToDataType(lod_tensors[0]->type()), func);
119+
120+
auto trg = out_var->GetMutable<framework::LoDTensor>();
121+
if (reduce_sum_trg.data<void>() != trg->data<void>()) {
122+
TensorCopy(reduce_sum_trg, platform::CPUPlace(), trg);
123+
}
124+
}
99125
});
100126
} else if (paddle::platform::is_gpu_place(lod_tensors[0]->place())) {
101127
#ifdef PADDLE_WITH_CUDA

paddle/fluid/framework/operator.cc

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -778,6 +778,7 @@ proto::VarType::Type OperatorWithKernel::IndicateDataType(
778778
const ExecutionContext& ctx) const {
779779
auto& scope = ctx.scope();
780780
int data_type = -1;
781+
std::string last_input_name;
781782
for (auto& input : this->inputs_) {
782783
for (auto& ipt_name : input.second) {
783784
auto* var = scope.FindVar(ipt_name);
@@ -794,9 +795,10 @@ proto::VarType::Type OperatorWithKernel::IndicateDataType(
794795
int tmp = static_cast<int>(ToDataType(t->type()));
795796
PADDLE_ENFORCE(
796797
tmp == data_type || data_type == -1,
797-
"DataType of Paddle Op %s must be the same. Get %d != %d", Type(),
798-
data_type, tmp);
798+
"DataType of Paddle Op %s must be the same. Get %s(%d) != %s(%d)",
799+
Type(), last_input_name, data_type, ipt_name, tmp);
799800
data_type = tmp;
801+
last_input_name = ipt_name;
800802
}
801803
}
802804
}

0 commit comments

Comments
 (0)