1
- 此教程会介绍如何使用Python的cProfile包,与Python库yep,google perftools来运行性能分析(Profiling)与调优。
1
+ This tutorial introduces techniques we used to profile and tune the
2
+ CPU performance of PaddlePaddle. We will use Python packages
3
+ ` cProfile ` and ` yep ` , and Google ` perftools ` .
2
4
3
- 运行性能分析可以让开发人员科学的,有条不紊的对程序进行性能优化。性能分析是性能调优的基础。因为在程序实际运行中,真正的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。
5
+ Profiling is the process that reveals the performance bottlenecks,
6
+ which could be very different from what's in the developers' mind.
7
+ Performance tuning is to fix the bottlenecks. Performance optimization
8
+ repeats the steps of profiling and tuning alternatively.
4
9
5
- 性能优化的步骤,通常是循环重复若干次『性能分析 --> 寻找瓶颈 ---> 调优瓶颈 --> 性能分析确认调优效果』。其中性能分析是性能调优的至关重要的量化指标。
10
+ PaddlePaddle users program AI by calling the Python API, which calls
11
+ into ` libpaddle.so. ` written in C++. In this tutorial, we focus on
12
+ the profiling and tuning of
6
13
7
- Paddle提供了Python语言绑定。用户使用Python进行神经网络编程,训练,测试。Python解释器通过` pybind ` 和` swig ` 调用Paddle的动态链接库,进而调用Paddle C++部分的代码。所以Paddle的性能分析与调优分为两个部分:
14
+ 1 . the Python code and
15
+ 1 . the mixture of Python and C++ code.
8
16
9
- * Python代码的性能分析
10
- * Python与C++混合代码的性能分析
17
+ ## Profiling the Python Code
11
18
19
+ ### Generate the Performance Profiling File
12
20
13
- ## Python代码的性能分析
14
-
15
- ### 生成性能分析文件
16
-
17
- Python标准库中提供了性能分析的工具包,[ cProfile] ( https://docs.python.org/2/library/profile.html ) 。生成Python性能分析的命令如下:
21
+ We can use Python standard
22
+ package, [ ` cProfile ` ] ( https://docs.python.org/2/library/profile.html ) ,
23
+ to generate Python profiling file. For example:
18
24
19
25
``` bash
20
26
python -m cProfile -o profile.out main.py
21
27
```
22
28
23
- 其中 ` -o ` 标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件, ` cProfile ` 会打印一些统计信息到 ` stdout ` 。这不方便我们进行后期处理(进行 ` sort ` , ` split ` , ` cut ` 等等)。
24
-
25
- ### 查看性能分析文件
29
+ where ` main.py ` is the program we are going to profile , ` -o ` specifies
30
+ the output file. Without ` -o ` , ` cProfile ` would outputs to standard
31
+ output.
26
32
27
- 当main.py运行完毕后,性能分析结果文件 ` profile.out ` 就生成出来了。我们可以使用 [ cprofilev ] ( https://github.com/ymichael/cprofilev ) 来查看性能分析结果。 ` cprofilev ` 是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来。
33
+ ### Look into the Profiling File
28
34
29
- 使用` pip install cprofilev ` 安装` cprofilev ` 工具。安装完成后,使用如下命令开启HTTP服务
35
+ ` cProfile ` generates ` profile.out ` after ` main.py ` completes. We can
36
+ use [ ` cprofilev ` ] ( https://github.com/ymichael/cprofilev ) to look into
37
+ the details:
30
38
31
39
``` bash
32
40
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
33
41
```
34
42
35
- 其中` -a ` 标识HTTP服务绑定的IP。使用` 0.0.0.0 ` 允许外网访问这个HTTP服务。` -p ` 标识HTTP服务的端口。` -f ` 标识性能分析的结果文件。` main.py ` 标识被性能分析的源文件。
43
+ where ` -a ` specifies the HTTP IP, ` -p ` specifies the port, ` -f `
44
+ specifies the profiling file, and ` main.py ` is the source file.
36
45
37
- 访问对应网址,即可显示性能分析的结果。性能分析结果格式如下:
46
+ Open the Web browser and points to the local IP and the specifies
47
+ port, we will see the output like the following:
38
48
39
- ``` text
49
+ ```
40
50
ncalls tottime percall cumtime percall filename:lineno(function)
41
51
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
42
52
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/executor.py:20(run)
43
53
4696 12.040 0.003 12.040 0.003 {built-in method run}
44
54
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
45
55
```
46
56
47
- 每一列的含义是:
57
+ where each line corresponds to Python function, and the meaning of
58
+ each column is as follows:
48
59
49
- | 列名 | 含义 |
60
+ | column | meaning |
50
61
| --- | --- |
51
- | ncalls | 函数的调用次数 |
52
- | tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 |
53
- | percall | tottime的每次调用平均时间 |
54
- | cumtime | 函数总时间。包含这个函数调用其他函数的时间 |
55
- | percall | cumtime的每次调用平均时间 |
56
- | filename: lineno (function) | 文件名, 行号,函数名 |
62
+ | ncalls | the number of calls into a function |
63
+ | tottime | the total execution time of the function, not including the
64
+ execution time of other functions called by the function |
65
+ | percall | tottime divided by ncalls |
66
+ | cumtime | the total execution time of the function, including the execution time of other functions being called |
67
+ | percall | cumtime divided by ncalls |
68
+ | filename: lineno (function) | where the function is defined |
57
69
70
+ ### Identify Performance Bottlenecks
58
71
59
- ### 寻找性能瓶颈
60
-
61
- 通常` tottime ` 和` cumtime ` 是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
62
-
63
- 将性能分析结果按照tottime排序,效果如下:
72
+ Usually, ` tottime ` and the related ` percall ` time is what we want to
73
+ focus on. We can sort above profiling file by tottime:
64
74
65
75
``` text
66
76
4696 12.040 0.003 12.040 0.003 {built-in method run}
67
77
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
68
78
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:219(__init__)
69
79
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/framework.py:428(sync_with_cpp)
70
80
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/fluid/__init__.py:1(<module>)
71
-
72
81
```
73
82
74
- 可以看到最耗时的函数是C++端的` run ` 函数。这需要联合我们第二节` Python与C++混合代码的性能分析 ` 来进行调优。而` sync_with_cpp ` 函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击` sync_with_cpp ` 的详细信息,了解其调用关系。
83
+ We can see that the most time-consuming function is the `built-in
84
+ method run` , which is a C++ function in ` libpaddle.so`. We will
85
+ explain how to profile C++ code in the next section. At the right
86
+ moment, let's look into the third function ` sync_with_cpp ` , which is a
87
+ Python function. We can click it to understand more about it:
75
88
76
- ``` text
89
+ ```
77
90
Called By:
78
91
79
92
Ordered by: internal time
@@ -92,72 +105,93 @@ Called:
92
105
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
93
106
```
94
107
95
- 通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
108
+ The lists of the callers of ` sync_with_cpp ` might help us understand
109
+ how to improve the function definition.
96
110
111
+ ## Profiling Python and C++ Code
97
112
113
+ ### Generate the Profiling File
98
114
99
- ## Python与C++混合代码的性能分析
115
+ To profile a mixture of Python and C++ code, we can use a Python
116
+ package, ` yep ` , that can work with Google's ` perftools ` , which is a
117
+ commonly-used profiler for C/C++ code.
100
118
101
- ### 生成性能分析文件
102
-
103
- C++的性能分析工具非常多。常见的包括` gprof ` , ` valgrind ` , ` google-perftools ` 。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库` yep ` 提供了方便的和` google-perftools ` 交互的方法。于是这里使用` yep ` 进行Python与C++混合代码的性能分析
104
-
105
- 使用` yep ` 前需要安装` google-perftools ` 与` yep ` 包。ubuntu下安装命令为
119
+ In Ubuntu systems, we can install ` yep ` and ` perftools ` by running the
120
+ following commands:
106
121
107
122
``` bash
123
+ apt update
108
124
apt install libgoogle-perftools-dev
109
125
pip install yep
110
126
```
111
127
112
- 安装完毕后,我们可以通过
128
+ Then we can run the following command
113
129
114
130
``` bash
115
131
python -m yep -v main.py
116
132
```
117
133
118
- 生成性能分析文件。生成的性能分析文件为` main.py.prof ` 。
134
+ to generate the profiling file. The default filename is
135
+ ` main.py.prof ` .
136
+
137
+ Please be aware of the ` -v ` command line option, which prints the
138
+ analysis results after generating the profiling file. By taking a
139
+ glance at the print result, we'd know that if we stripped debug
140
+ information from ` libpaddle.so ` at build time. The following hints
141
+ help make sure that the analysis results are readable:
119
142
120
- 命令行中的` -v ` 指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施:
143
+ 1 . Use GCC command line option ` -g ` when building ` libpaddle.so ` so to
144
+ include the debug information. The standard building system of
145
+ PaddlePaddle is CMake, so you might want to set
146
+ ` CMAKE_BUILD_TYPE=RelWithDebInfo ` .
121
147
122
- 1 . 编译时指定 ` -g ` 生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为 ` RelWithDebInfo ` 。
123
- 2 . 编译时一定要开启优化。单纯的 ` Debug ` 编译性能会和 ` -O2 ` 或者 ` -O3 ` 有非常大的差别。 ` Debug ` 模式下的性能测试是没有意义的。
124
- 3 . 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟如果单线程调试更容易。可以设置 ` OMP_NUM_THREADS=1 ` 这个环境变量关闭openmp优化。
148
+ 1 . Use GCC command line option ` -O2 ` or ` -O3 ` to generate optimized
149
+ binary code. It doesn't make sense to profile ` libpaddle.so `
150
+ without optimization, because it would anyway run slowly.
125
151
126
- ### 查看性能分析文件
152
+ 1 . Profiling the single-threaded binary file before the
153
+ multi-threading version, because the latter often generates tangled
154
+ profiling analysis result. You might want to set environment
155
+ variable ` OMP_NUM_THREADS=1 ` to prevents OpenMP from automatically
156
+ starting multiple threads.
127
157
128
- 在运行完性能分析后,会生成性能分析结果文件。我们可以使用 [ pprof ] ( https://github.com/google/pprof ) 来显示性能分析结果。注意,这里使用了用 ` Go ` 语言重构后的 ` pprof ` ,因为这个工具具有web服务界面,且展示效果更好。
158
+ ### Look into the Profiling File
129
159
130
- 安装` pprof ` 的命令和一般的` Go ` 程序是一样的,其命令如下:
160
+ The tool we used to look into the profiling file generated by
161
+ ` perftools ` is [ ` pprof ` ] ( https://github.com/google/pprof ) , which
162
+ provides a Web-based GUI like ` cprofilev ` .
163
+
164
+ We can rely on the standard Go toolchain to retrieve the source code
165
+ of ` pprof ` and build it:
131
166
132
167
``` bash
133
168
go get github.com/google/pprof
134
169
```
135
170
136
- 进而我们可以使用如下命令开启一个HTTP服务:
171
+ Then we can use it to profile ` main.py.prof ` generated in the previous
172
+ section:
137
173
138
174
``` bash
139
175
pprof -http=0.0.0.0:3213 ` which python` ./main.py.prof
140
176
```
141
177
142
- 这行命令中, ` -http ` 指开启HTTP服务。 ` which python ` 会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。 ` ./main.py.prof ` 输入了性能分析结果。
143
-
144
- 访问对应的网址,我们可以查看性能分析的结果。结果如下图所示 :
178
+ Where ` -http ` specifies the IP and port of the HTTP service.
179
+ Directing our Web browser to the service, we would see something like
180
+ the following :
145
181
146
182
![ result] ( ./pprof_1.png )
147
183
184
+ ### Identifying the Performance Bottlenecks
148
185
149
- ### 寻找性能瓶颈
150
-
151
- 与寻找Python代码的性能瓶颈类似,寻找Python与C++混合代码的性能瓶颈也是要看` tottime ` 和` cumtime ` 。而` pprof ` 展示的调用图也可以帮助我们发现性能中的问题。
152
-
153
- 例如下图中,
186
+ Similar to how we work with ` cprofilev ` , we'd focus on ` tottime ` and
187
+ ` cumtime ` .
154
188
155
189
![ kernel_perf] ( ./pprof_2.png )
156
190
157
- 在一次训练中,乘法和乘法梯度的计算占用2%-4%左右的计算时间。而` MomentumOp ` 占用了17%左右的计算时间。显然,` MomentumOp ` 的性能有问题。
158
-
159
- 在` pprof ` 中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。
160
-
161
- ## 总结
191
+ We can see that the execution time of multiplication and the computing
192
+ of the gradient of multiplication takes 2% to 4% of the total running
193
+ time, and ` MomentumOp ` takes about 17%. Obviously, we'd want to
194
+ optimize ` MomentumOp ` .
162
195
163
- 至此,两种性能分析的方式都介绍完毕了。希望通过这两种性能分析的方式,Paddle的开发人员和使用人员可以有次序的,科学的发现和解决性能问题。
196
+ ` pprof ` would mark performance critical parts of the program in
197
+ red. It's a good idea to follow the hint.
0 commit comments