Skip to content

Commit 192c00a

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into enhance-include-pool
2 parents fe6af6b + 1238706 commit 192c00a

38 files changed

+1348
-412
lines changed

cmake/external/grpc.cmake

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ SET(GRPC_INSTALL_DIR ${THIRD_PARTY_PATH}/install/grpc)
2424
SET(GRPC_INCLUDE_DIR "${GRPC_INSTALL_DIR}/include/" CACHE PATH "grpc include directory." FORCE)
2525
SET(GRPC_CPP_PLUGIN "${GRPC_INSTALL_DIR}/bin/grpc_cpp_plugin" CACHE FILEPATH "GRPC_CPP_PLUGIN" FORCE)
2626
IF(APPLE)
27-
SET(BUILD_CMD make -n | sed "s/-Werror//g" | sh)
27+
SET(BUILD_CMD make -n HAS_SYSTEM_PROTOBUF=false -s -j8 static grpc_cpp_plugin | sed "s/-Werror//g" | sh)
2828
ELSE()
29-
SET(BUILD_CMD make)
29+
SET(BUILD_CMD make HAS_SYSTEM_PROTOBUF=false -s -j8 static grpc_cpp_plugin)
3030
ENDIF()
3131

3232
ExternalProject_Add(
@@ -42,7 +42,7 @@ ExternalProject_Add(
4242
# Disable -Werror, otherwise the compile will fail in MacOS.
4343
# It seems that we cannot configure that by make command.
4444
# Just dry run make command and remove `-Werror`, then use a shell to run make commands
45-
BUILD_COMMAND ${BUILD_CMD} HAS_SYSTEM_PROTOBUF=false -s -j8 static grpc_cpp_plugin
45+
BUILD_COMMAND ${BUILD_CMD}
4646
INSTALL_COMMAND make prefix=${GRPC_INSTALL_DIR} install
4747
)
4848

cmake/generic.cmake

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -227,8 +227,8 @@ function(cc_test TARGET_NAME)
227227
set(multiValueArgs SRCS DEPS)
228228
cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
229229
add_executable(${TARGET_NAME} ${cc_test_SRCS})
230-
target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} gtest gtest_main)
231-
add_dependencies(${TARGET_NAME} ${cc_test_DEPS} gtest gtest_main)
230+
target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main paddle_memory gtest gflags)
231+
add_dependencies(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main paddle_memory gtest gflags)
232232
add_test(NAME ${TARGET_NAME} COMMAND ${TARGET_NAME} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
233233
endif()
234234
endfunction(cc_test)
@@ -288,8 +288,8 @@ function(nv_test TARGET_NAME)
288288
set(multiValueArgs SRCS DEPS)
289289
cmake_parse_arguments(nv_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
290290
cuda_add_executable(${TARGET_NAME} ${nv_test_SRCS})
291-
target_link_libraries(${TARGET_NAME} ${nv_test_DEPS} gtest gtest_main)
292-
add_dependencies(${TARGET_NAME} ${nv_test_DEPS} gtest gtest_main)
291+
target_link_libraries(${TARGET_NAME} ${nv_test_DEPS} paddle_gtest_main paddle_memory gtest gflags)
292+
add_dependencies(${TARGET_NAME} ${nv_test_DEPS} paddle_gtest_main paddle_memory gtest gflags)
293293
add_test(${TARGET_NAME} ${TARGET_NAME})
294294
endif()
295295
endfunction(nv_test)

doc/design/float16.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,51 @@ The goal of float16 is to serve as a key for the executor to find and run the co
2828
- [Eigen](https://github.com/RLovelett/eigen) >= 3.3 supports float16 calculation on both GPU and CPU using the `Eigen::half` class. It is mostly useful for Nvidia GPUs because of the overloaded arithmetic operators using cuda intrinsics. It falls back to using software emulation on CPU for calculation and there is no special treatment to ARM processors.
2929
- [ARM compute library](https://github.com/ARM-software/ComputeLibrary) >= 17.02.01 supports NEON FP16 kernels (requires ARMv8.2-A CPU).
3030

31+
### CUDA version issue
32+
There are currently three versions of CUDA that supports `__half` data type, namely, CUDA 7.5, 8.0, and 9.0.
33+
CUDA 7.5 and 8.0 define `__half` as a simple struct that has a `uint16_t` data (see [`cuda_fp16.h`](https://github.com/ptillet/isaac/blob/9212ab5a3ddbe48f30ef373f9c1fb546804c7a8c/include/isaac/external/CUDA/cuda_fp16.h)) as follows:
34+
```
35+
typedef struct __align__(2) {
36+
unsigned short x;
37+
} __half;
38+
39+
typedef __half half;
40+
```
41+
This struct does not define any overloaded arithmetic operators. So you have to directly use `__hadd` instead of `+` to correctly add two half types:
42+
```
43+
__global__ void Add() {
44+
half a, b, c;
45+
c = __hadd(a, b); // correct
46+
c = a + b; // compiler error: no operator "+" matches these operands
47+
}
48+
```
49+
CUDA 9.0 provides a major update to the half data type. The related code can be found in the updated [`cuda_fp16.h`](https://github.com/ptillet/isaac/blob/master/include/isaac/external/CUDA/cuda_fp16.h) and the newly added [`cuda_fp16.hpp`](https://github.com/ptillet/isaac/blob/master/include/isaac/external/CUDA/cuda_fp16.hpp).
50+
51+
Essentially, CUDA 9.0 renames the original `__half` type in 7.5 and 8.0 as `__half_raw`, and defines a new `__half` class type that has constructors, conversion operators, and also provides overloaded arithmetic operators such as follows:
52+
```
53+
typedef struct __CUDA_ALIGN__(2) {
54+
unsigned short x;
55+
} __half_raw;
56+
57+
58+
struct __CUDA_ALIGN__(2) __half {
59+
protected:
60+
unsigned short __x;
61+
public:
62+
// constructors and conversion operators from/to
63+
// __half_raw and other built-in data types
64+
}
65+
66+
typedef __half half;
67+
68+
__device__ __forceinline__
69+
__half operator+(const __half &lh, const __half &rh) {
70+
return __hadd(lh, rh);
71+
}
72+
73+
// Other overloaded operators
74+
```
75+
This new design makes `c = a + b` work correctly for CUDA half data type.
3176

3277
## Implementation
3378
The float16 class holds a 16-bit `uint16_t` data internally.
Lines changed: 100 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,92 @@
1-
此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。
1+
This tutorial introduces techniques we used to profile and tune the
2+
CPU performance of PaddlePaddle. We will use Python packages
3+
`cProfile` and `yep`, and Google `perftools`.
24

3-
运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。
5+
Profiling is the process that reveals the performance bottlenecks,
6+
which could be very different from what's in the developers' mind.
7+
Performance tuning is to fix the bottlenecks. Performance optimization
8+
repeats the steps of profiling and tuning alternatively.
49

5-
性能优化的步骤,通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。
10+
PaddlePaddle users program AI by calling the Python API, which calls
11+
into `libpaddle.so.` written in C++. In this tutorial, we focus on
12+
the profiling and tuning of
613

7-
Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过`pybind``swig`调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:
14+
1. the Python code and
15+
1. the mixture of Python and C++ code.
816

9-
* Python代码的性能分析
10-
* Python与C++混合代码的性能分析
17+
## Profiling the Python Code
1118

19+
### Generate the Performance Profiling File
1220

13-
## Python代码的性能分析
14-
15-
### 生成性能分析文件
16-
17-
Python标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
21+
We can use Python standard
22+
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
23+
to generate Python profiling file. For example:
1824

1925
```bash
2026
python -m cProfile -o profile.out main.py
2127
```
2228

23-
其中`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。
24-
25-
### 查看性能分析文件
29+
where `main.py` is the program we are going to profile, `-o` specifies
30+
the output file. Without `-o`, `cProfile` would outputs to standard
31+
output.
2632

27-
当main.py运行完毕后,性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。
33+
### Look into the Profiling File
2834

29-
使用`pip install cprofilev`安装`cprofilev`工具。安装完成后,使用如下命令开启HTTP服务
35+
`cProfile` generates `profile.out` after `main.py` completes. We can
36+
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
37+
the details:
3038

3139
```bash
3240
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
3341
```
3442

35-
其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
43+
where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
44+
specifies the profiling file, and `main.py` is the source file.
3645

37-
访问对应网址,即可显示性能分析的结果。性能分析结果格式如下:
46+
Open the Web browser and points to the local IP and the specifies
47+
port, we will see the output like the following:
3848

39-
```text
49+
```
4050
ncalls tottime percall cumtime percall filename:lineno(function)
4151
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4252
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
4353
4696 12.040 0.003 12.040 0.003 {built-in method run}
4454
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
4555
```
4656

47-
每一列的含义是:
57+
where each line corresponds to Python function, and the meaning of
58+
each column is as follows:
4859

49-
| 列名 | 含义 |
60+
| column | meaning |
5061
| --- | --- |
51-
| ncalls | 函数的调用次数 |
52-
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
53-
| percall | tottime的每次调用平均时间 |
54-
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
55-
| percall | cumtime的每次调用平均时间 |
56-
| filename:lineno(function) | 文件名, 行号,函数名 |
62+
| ncalls | the number of calls into a function |
63+
| tottime | the total execution time of the function, not including the
64+
execution time of other functions called by the function |
65+
| percall | tottime divided by ncalls |
66+
| cumtime | the total execution time of the function, including the execution time of other functions being called |
67+
| percall | cumtime divided by ncalls |
68+
| filename:lineno(function) | where the function is defined |
5769

70+
### Identify Performance Bottlenecks
5871

59-
### 寻找性能瓶颈
60-
61-
通常`tottime``cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
62-
63-
将性能分析结果按照tottime排序,效果如下:
72+
Usually, `tottime` and the related `percall` time is what we want to
73+
focus on. We can sort above profiling file by tottime:
6474

6575
```text
6676
4696 12.040 0.003 12.040 0.003 {built-in method run}
6777
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
6878
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
6979
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
7080
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
71-
7281
```
7382

74-
可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python与C++混合代码的性能分析`来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。
83+
We can see that the most time-consuming function is the `built-in
84+
method run`, which is a C++ function in `libpaddle.so`. We will
85+
explain how to profile C++ code in the next section. At the right
86+
moment, let's look into the third function `sync_with_cpp`, which is a
87+
Python function. We can click it to understand more about it:
7588

76-
```text
89+
```
7790
Called By:
7891
7992
Ordered by: internal time
@@ -92,72 +105,93 @@ Called:
92105
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
93106
```
94107

95-
通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
108+
The lists of the callers of `sync_with_cpp` might help us understand
109+
how to improve the function definition.
96110

111+
## Profiling Python and C++ Code
97112

113+
### Generate the Profiling File
98114

99-
## Python与C++混合代码的性能分析
115+
To profile a mixture of Python and C++ code, we can use a Python
116+
package, `yep`, that can work with Google's `perftools`, which is a
117+
commonly-used profiler for C/C++ code.
100118

101-
### 生成性能分析文件
102-
103-
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
104-
105-
使用`yep`前需要安装`google-perftools``yep`包。ubuntu下安装命令为
119+
In Ubuntu systems, we can install `yep` and `perftools` by running the
120+
following commands:
106121

107122
```bash
123+
apt update
108124
apt install libgoogle-perftools-dev
109125
pip install yep
110126
```
111127

112-
安装完毕后,我们可以通过
128+
Then we can run the following command
113129

114130
```bash
115131
python -m yep -v main.py
116132
```
117133

118-
生成性能分析文件。生成的性能分析文件为`main.py.prof`
134+
to generate the profiling file. The default filename is
135+
`main.py.prof`.
136+
137+
Please be aware of the `-v` command line option, which prints the
138+
analysis results after generating the profiling file. By taking a
139+
glance at the print result, we'd know that if we stripped debug
140+
information from `libpaddle.so` at build time. The following hints
141+
help make sure that the analysis results are readable:
119142

120-
命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:
143+
1. Use GCC command line option `-g` when building `libpaddle.so` so to
144+
include the debug information. The standard building system of
145+
PaddlePaddle is CMake, so you might want to set
146+
`CMAKE_BUILD_TYPE=RelWithDebInfo`.
121147

122-
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`
123-
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
124-
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟如果单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
148+
1. Use GCC command line option `-O2` or `-O3` to generate optimized
149+
binary code. It doesn't make sense to profile `libpaddle.so`
150+
without optimization, because it would anyway run slowly.
125151

126-
### 查看性能分析文件
152+
1. Profiling the single-threaded binary file before the
153+
multi-threading version, because the latter often generates tangled
154+
profiling analysis result. You might want to set environment
155+
variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
156+
starting multiple threads.
127157

128-
在运行完性能分析后,会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有web服务界面,且展示效果更好。
158+
### Look into the Profiling File
129159

130-
安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下:
160+
The tool we used to look into the profiling file generated by
161+
`perftools` is [`pprof`](https://github.com/google/pprof), which
162+
provides a Web-based GUI like `cprofilev`.
163+
164+
We can rely on the standard Go toolchain to retrieve the source code
165+
of `pprof` and build it:
131166

132167
```bash
133168
go get github.com/google/pprof
134169
```
135170

136-
进而我们可以使用如下命令开启一个HTTP服务:
171+
Then we can use it to profile `main.py.prof` generated in the previous
172+
section:
137173

138174
```bash
139175
pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
140176
```
141177

142-
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
143-
144-
访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:
178+
Where `-http` specifies the IP and port of the HTTP service.
179+
Directing our Web browser to the service, we would see something like
180+
the following:
145181

146182
![result](./pprof_1.png)
147183

184+
### Identifying the Performance Bottlenecks
148185

149-
### 寻找性能瓶颈
150-
151-
与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看`tottime``cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
152-
153-
例如下图中,
186+
Similar to how we work with `cprofilev`, we'd focus on `tottime` and
187+
`cumtime`.
154188

155189
![kernel_perf](./pprof_2.png)
156190

157-
在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然,`MomentumOp`的性能有问题。
158-
159-
`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。
160-
161-
## 总结
191+
We can see that the execution time of multiplication and the computing
192+
of the gradient of multiplication takes 2% to 4% of the total running
193+
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
194+
optimize `MomentumOp`.
162195

163-
至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。
196+
`pprof` would mark performance critical parts of the program in
197+
red. It's a good idea to follow the hint.

0 commit comments

Comments
 (0)