15
15
A common bad property of many different JIT compilers is that of a "performance
16
16
cliff": A seemingly reasonable code change, leading to massively reduced
17
17
performance due to hitting some weird property of the JIT compiler that's not
18
- easy to understand for the programmer. Hitting a performance cliff as a
18
+ easy to understand for the programmer (e.g. here's a blog post about the fix of
19
+ a performance cliff when running [ React on
20
+ V8] ( https://v8.dev/blog/react-cliff ) ). Hitting a performance cliff as a
19
21
programmer can be intensely frustrating and turn people off from using PyPy
20
22
altogether. Recently we've been working on trying to remove some of PyPy's
21
23
performance cliffs, and this post describes one such effort.
22
24
23
25
The problem showed up in an [ issue] ( https://foss.heptapod.net/pypy/pypy/-/issues/3402 )
24
- where somebody described found the performance
25
- of their website using Tornado a lot worse than what various benchmarks
26
- suggested. It took some careful digging down into the problem to figure out what
27
- caused the problem, this blog post will be about how we solved it.
26
+ where somebody found the performance
27
+ of their website using [ Tornado] ( https://www.tornadoweb.org/en/stable/ ) a lot
28
+ worse than what various benchmarks suggested. It took some careful digging to
29
+ figure out what caused the problem: The slow performance was caused by the huge
30
+ functions that the Tornado templating engine creates. These functions lead the
31
+ JIT to behave in unproductive ways. This blog post will be about how we fixed
32
+ this problem.
33
+
34
+ # Problem
35
+
36
+ After quite a bit of debugging we narrowed down the problem to the following
37
+ reproducer: If you render a big HTML template
38
+ ([ example] ( https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291#file-index-html ) )
39
+ using the Tornado templating engine, the template rendering is really not any
40
+ faster than CPython. A small template doesn't show this behavior, and other
41
+ parts of Tornado seem to perform well. So we looked into how the templating
42
+ engine works, and it turns out that the templates are compiled into Python
43
+ functions. This means that a big template can turn into a really enormous Python
44
+ function ([ Python version of the
45
+ example] ( https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291#file-zz_autogenerated-py ) ).
46
+ For some reason really enormous Python functions aren't handled particularly
47
+ well by the JIT, and in the next section I'll explain some the background that's
48
+ necessary to understand why this happens.
28
49
29
50
# Trace Limits and Inlining
30
51
@@ -34,8 +55,9 @@ in, the reason for that is some limitation in the compact encoding of traces in
34
55
the JIT. Another reason is that we don't want to generate arbitrary large chunks
35
56
of machine code. Usually, when we hit the trace limit, it is due to * inlining* .
36
57
While tracing, the JIT will inline many of the functions called from the
37
- outermost one. This can lead to the trace being too long. If that happens, we
38
- will mark a called function as uninlinable and the next time we trace the outer
58
+ outermost one. This is usually good and improves performance greatly, however,
59
+ inlining can also lead to the trace being too long. If that happens, we
60
+ will mark a called function as uninlinable. The next time we trace the outer
39
61
function we won't inline it, leading to a shorter trace, which hopefully fits
40
62
the trace limit.
41
63
@@ -47,28 +69,33 @@ disables inlining of `g`. The next time we try to trace `f` the trace will
47
69
contain a * call* to ` g ` instead of inlining it. The trace ends up being not too
48
70
long, so we can turn it into machine code when tracing finishes.
49
71
50
- This is where the problem occurs: sometimes, the outermost function itself
72
+ Now we know enough to understand what the problem with automatically generated
73
+ code is: sometimes, the outermost function itself
51
74
doesn't fit the trace limit, without any inlining going on at all. This is
52
75
usually not the case for normal, hand-written Python functions. However, it can
53
76
happen for automatically generated Python code, such as the code that the
54
77
Tornado templating engine produces.
55
78
56
- This is what used to happen in such a situation: the function is traced until
57
- the trace is too long. Then the trace limits stops further tracing. This happens
58
- again and again. The effect is that the function is even slowed down: we spend
59
- time tracing it, but that effort is never useful, so the resulting execution
60
- can be slower than not using the JIT at all!
79
+ So, what happens when the JIT hits such a huge function? The function is traced
80
+ until the trace is too long. Then the trace limits stops further tracing. Since
81
+ nothing was inlined, we cannot make the trace shorter the next time by disabling
82
+ inlining. Therefore, this happens again and again, the next time we trace the
83
+ function we run into exactly the same problem. The net effect is that the
84
+ function is even slowed down: we spend time tracing it, then stop tracing and
85
+ throw the trace away. Therefore, that effort is never useful, so the resulting
86
+ execution can be slower than not using the JIT at all!
61
87
62
88
63
89
# Solution
64
90
65
91
To get out of the endless cycle of useless retracing we first had the idea of
66
- simply disabling all code generation for such functions, that produce too long
92
+ simply disabling all code generation for such huge functions, that produce too long
67
93
traces even if there is no inlining at all. However, that lead to disappointing
68
- performance, because important parts of the code were always interpreted.
94
+ performance in the example Tornado program, because important parts of the code
95
+ remain always interpreted.
69
96
70
97
Instead, our solution is now as follows: After we have hit the trace limit and
71
- no inlining has happened so far, we mark that function as a source of huge
98
+ no inlining has happened so far, we mark the outermost function as a source of huge
72
99
traces. The next time we trace such a function, we do so in a special mode. In
73
100
that mode, hitting the trace limit behaves differently: Instead of stopping the
74
101
tracer and throwing away the trace produced so far, we will use the unfinished
@@ -77,8 +104,9 @@ function, but stops at a basically arbitrary point in the middle of the
77
104
function.
78
105
79
106
The question is what should happen when execution
80
- reaches the end of this unfinished trace. We want to be able to extend the trace
81
- from that point and add another piece of machine code, but not do that too
107
+ reaches the end of this unfinished trace. We want to be able to cover more of
108
+ the function with machine code and therefore need to extend the trace
109
+ from that point on. But we don't want to do that too
82
110
eagerly to prevent lots and lots of machine code being generated. To achieve
83
111
this behaviour we add a guard to the end of the unfinished trace, which will
84
112
always fail. This has the right behaviour: a failing guard will transfer control
@@ -105,13 +133,14 @@ closing the loop and jumping back to trace 1, or by returning from `f`).
105
133
106
134
# Evaluation
107
135
108
- Since this is a performance cliff that we didn't observe in any of our own
109
- benchmarks ourselves, it's pointless to look at its effect on existing
110
- benchmarks – there shouldn't and indeed there isn't any.
136
+ Since this is a performance cliff that we didn't observe in any of our
137
+ [ benchmarks] ( http://speed.pypy.org/ ) ourselves, it's pointless to look at the
138
+ effect that this improvement has on existing benchmarks – there shouldn't and
139
+ indeed there isn't any.
111
140
112
141
Instead, we are going to look at a micro-benchmark that came out of the
113
- original bug report, one that simply renders a big artificial Tornado template.
114
- The code of the micro-benchmark can be found
142
+ original bug report, one that simply renders a big artificial Tornado template
143
+ 200 times. The code of the micro-benchmark can be found
115
144
[ here] ( https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291 ) .
116
145
117
146
All benchmarks were run 10 times in new processes. The means and standard
@@ -137,8 +166,8 @@ and for how many traces we produced actual machine code:
137
166
| PyPy3 JIT new | 30 | 25 | 0.06s |
138
167
139
168
Here we can clearly see the problem: The old JIT would try tracing the
140
- auto-generated code by the template again and again, but would never produce a
141
- useful trace , wasting lots of time in the process. The new JIT still traces a
169
+ auto-generated templating code again and again, but would never actually produce
170
+ any machine code , wasting lots of time in the process. The new JIT still traces a
142
171
few times uselessly, but then eventually converges and stops emitting machine
143
172
code for all the paths through the auto-generated Python code.
144
173
@@ -235,6 +264,9 @@ have such mechanisms as well).
235
264
In this post we've described a performance cliff in PyPy's JIT, that of really
236
265
big auto-generated functions which hit the trace limit without inlining, that we
237
266
still want to generate machine code for. We achieve this by chunking up the
238
- trace into several smaller cases, which we added piece by piece. The work is a
239
- tiny bit experimental still, but we will release it as part of the upcoming 3.8
240
- beta release, to get some more experience with it.
267
+ trace into several smaller traces, which we compile piece by piece. The work
268
+ described in this post tiny bit experimental still, but we will release it as
269
+ part of the upcoming 3.8 beta release, to get some more experience with it.
270
+ Please grab a [ 3.8 release
271
+ candidate] ( https://mail.python.org/pipermail/pypy-dev/2021-September/016214.html ) ,
272
+ try it out and let us know your observations, good and bad!
0 commit comments