Skip to content

Commit 605b3e4

Browse files
wangkuiyiabhinavarora
authored andcommitted
Translate the CPU profiling document (#6073)
* Translate the CPU profiling document * Paragraphing
1 parent ac596a3 commit 605b3e4

File tree

2 files changed

+255
-66
lines changed

2 files changed

+255
-66
lines changed
Lines changed: 100 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,92 @@
1-
此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。
1+
This tutorial introduces techniques we used to profile and tune the
2+
CPU performance of PaddlePaddle. We will use Python packages
3+
`cProfile` and `yep`, and Google `perftools`.
24

3-
运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。
5+
Profiling is the process that reveals the performance bottlenecks,
6+
which could be very different from what's in the developers' mind.
7+
Performance tuning is to fix the bottlenecks. Performance optimization
8+
repeats the steps of profiling and tuning alternatively.
49

5-
性能优化的步骤,通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。
10+
PaddlePaddle users program AI by calling the Python API, which calls
11+
into `libpaddle.so.` written in C++. In this tutorial, we focus on
12+
the profiling and tuning of
613

7-
Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过`pybind``swig`调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:
14+
1. the Python code and
15+
1. the mixture of Python and C++ code.
816

9-
* Python代码的性能分析
10-
* Python与C++混合代码的性能分析
17+
## Profiling the Python Code
1118

19+
### Generate the Performance Profiling File
1220

13-
## Python代码的性能分析
14-
15-
### 生成性能分析文件
16-
17-
Python标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
21+
We can use Python standard
22+
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
23+
to generate Python profiling file. For example:
1824

1925
```bash
2026
python -m cProfile -o profile.out main.py
2127
```
2228

23-
其中`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印一些统计信息到`stdout`。这不方便我们进行后期处理(进行`sort`, `split`, `cut`等等)。
24-
25-
### 查看性能分析文件
29+
where `main.py` is the program we are going to profile, `-o` specifies
30+
the output file. Without `-o`, `cProfile` would outputs to standard
31+
output.
2632

27-
当main.py运行完毕后,性能分析结果文件`profile.out`就生成出来了。我们可以使用[cprofilev](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。
33+
### Look into the Profiling File
2834

29-
使用`pip install cprofilev`安装`cprofilev`工具。安装完成后,使用如下命令开启HTTP服务
35+
`cProfile` generates `profile.out` after `main.py` completes. We can
36+
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
37+
the details:
3038

3139
```bash
3240
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
3341
```
3442

35-
其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
43+
where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
44+
specifies the profiling file, and `main.py` is the source file.
3645

37-
访问对应网址,即可显示性能分析的结果。性能分析结果格式如下:
46+
Open the Web browser and points to the local IP and the specifies
47+
port, we will see the output like the following:
3848

39-
```text
49+
```
4050
ncalls tottime percall cumtime percall filename:lineno(function)
4151
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4252
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
4353
4696 12.040 0.003 12.040 0.003 {built-in method run}
4454
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
4555
```
4656

47-
每一列的含义是:
57+
where each line corresponds to Python function, and the meaning of
58+
each column is as follows:
4859

49-
| 列名 | 含义 |
60+
| column | meaning |
5061
| --- | --- |
51-
| ncalls | 函数的调用次数 |
52-
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
53-
| percall | tottime的每次调用平均时间 |
54-
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
55-
| percall | cumtime的每次调用平均时间 |
56-
| filename:lineno(function) | 文件名, 行号,函数名 |
62+
| ncalls | the number of calls into a function |
63+
| tottime | the total execution time of the function, not including the
64+
execution time of other functions called by the function |
65+
| percall | tottime divided by ncalls |
66+
| cumtime | the total execution time of the function, including the execution time of other functions being called |
67+
| percall | cumtime divided by ncalls |
68+
| filename:lineno(function) | where the function is defined |
5769

70+
### Identify Performance Bottlenecks
5871

59-
### 寻找性能瓶颈
60-
61-
通常`tottime``cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
62-
63-
将性能分析结果按照tottime排序,效果如下:
72+
Usually, `tottime` and the related `percall` time is what we want to
73+
focus on. We can sort above profiling file by tottime:
6474

6575
```text
6676
4696 12.040 0.003 12.040 0.003 {built-in method run}
6777
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
6878
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
6979
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
7080
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
71-
7281
```
7382

74-
可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python``C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。
83+
We can see that the most time-consuming function is the `built-in
84+
method run`, which is a C++ function in `libpaddle.so`. We will
85+
explain how to profile C++ code in the next section. At the right
86+
moment, let's look into the third function `sync_with_cpp`, which is a
87+
Python function. We can click it to understand more about it:
7588

76-
```text
89+
```
7790
Called By:
7891
7992
Ordered by: internal time
@@ -92,72 +105,93 @@ Called:
92105
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
93106
```
94107

95-
通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
108+
The lists of the callers of `sync_with_cpp` might help us understand
109+
how to improve the function definition.
96110

111+
## Profiling Python and C++ Code
97112

113+
### Generate the Profiling File
98114

99-
## Python与C++混合代码的性能分析
115+
To profile a mixture of Python and C++ code, we can use a Python
116+
package, `yep`, that can work with Google's `perftools`, which is a
117+
commonly-used profiler for C/C++ code.
100118

101-
### 生成性能分析文件
102-
103-
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
104-
105-
使用`yep`前需要安装`google-perftools``yep`包。ubuntu下安装命令为
119+
In Ubuntu systems, we can install `yep` and `perftools` by running the
120+
following commands:
106121

107122
```bash
123+
apt update
108124
apt install libgoogle-perftools-dev
109125
pip install yep
110126
```
111127

112-
安装完毕后,我们可以通过
128+
Then we can run the following command
113129

114130
```bash
115131
python -m yep -v main.py
116132
```
117133

118-
生成性能分析文件。生成的性能分析文件为`main.py.prof`
134+
to generate the profiling file. The default filename is
135+
`main.py.prof`.
136+
137+
Please be aware of the `-v` command line option, which prints the
138+
analysis results after generating the profiling file. By taking a
139+
glance at the print result, we'd know that if we stripped debug
140+
information from `libpaddle.so` at build time. The following hints
141+
help make sure that the analysis results are readable:
119142

120-
命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:
143+
1. Use GCC command line option `-g` when building `libpaddle.so` so to
144+
include the debug information. The standard building system of
145+
PaddlePaddle is CMake, so you might want to set
146+
`CMAKE_BUILD_TYPE=RelWithDebInfo`.
121147

122-
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`
123-
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
124-
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
148+
1. Use GCC command line option `-O2` or `-O3` to generate optimized
149+
binary code. It doesn't make sense to profile `libpaddle.so`
150+
without optimization, because it would anyway run slowly.
125151

126-
### 查看性能分析文件
152+
1. Profiling the single-threaded binary file before the
153+
multi-threading version, because the latter often generates tangled
154+
profiling analysis result. You might want to set environment
155+
variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
156+
starting multiple threads.
127157

128-
在运行完性能分析后,会生成性能分析结果文件。我们可以使用[pprof](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有web服务界面,且展示效果更好。
158+
### Look into the Profiling File
129159

130-
安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下:
160+
The tool we used to look into the profiling file generated by
161+
`perftools` is [`pprof`](https://github.com/google/pprof), which
162+
provides a Web-based GUI like `cprofilev`.
163+
164+
We can rely on the standard Go toolchain to retrieve the source code
165+
of `pprof` and build it:
131166

132167
```bash
133168
go get github.com/google/pprof
134169
```
135170

136-
进而我们可以使用如下命令开启一个HTTP服务:
171+
Then we can use it to profile `main.py.prof` generated in the previous
172+
section:
137173

138174
```bash
139175
pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
140176
```
141177

142-
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
143-
144-
访问对应的网址,我们可以查看性能分析的结果。结果如下图所示:
178+
Where `-http` specifies the IP and port of the HTTP service.
179+
Directing our Web browser to the service, we would see something like
180+
the following:
145181

146182
![result](./pprof_1.png)
147183

184+
### Identifying the Performance Bottlenecks
148185

149-
### 寻找性能瓶颈
150-
151-
与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看`tottime``cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
152-
153-
例如下图中,
186+
Similar to how we work with `cprofilev`, we'd focus on `tottime` and
187+
`cumtime`.
154188

155189
![kernel_perf](./pprof_2.png)
156190

157-
在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然,`MomentumOp`的性能有问题。
158-
159-
`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。
160-
161-
## 总结
191+
We can see that the execution time of multiplication and the computing
192+
of the gradient of multiplication takes 2% to 4% of the total running
193+
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
194+
optimize `MomentumOp`.
162195

163-
至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。
196+
`pprof` would mark performance critical parts of the program in
197+
red. It's a good idea to follow the hint.

0 commit comments

Comments
 (0)