Skip to content

Commit e06912a

Browse files
committed
do a round of editing
1 parent 5639710 commit e06912a

File tree

1 file changed

+60
-28
lines changed

1 file changed

+60
-28
lines changed

posts/2021/09/open-ended-traces.md

Lines changed: 60 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,37 @@
1515
A common bad property of many different JIT compilers is that of a "performance
1616
cliff": A seemingly reasonable code change, leading to massively reduced
1717
performance due to hitting some weird property of the JIT compiler that's not
18-
easy to understand for the programmer. Hitting a performance cliff as a
18+
easy to understand for the programmer (e.g. here's a blog post about the fix of
19+
a performance cliff when running [React on
20+
V8](https://v8.dev/blog/react-cliff)). Hitting a performance cliff as a
1921
programmer can be intensely frustrating and turn people off from using PyPy
2022
altogether. Recently we've been working on trying to remove some of PyPy's
2123
performance cliffs, and this post describes one such effort.
2224

2325
The problem showed up in an [issue](https://foss.heptapod.net/pypy/pypy/-/issues/3402)
24-
where somebody described found the performance
25-
of their website using Tornado a lot worse than what various benchmarks
26-
suggested. It took some careful digging down into the problem to figure out what
27-
caused the problem, this blog post will be about how we solved it.
26+
where somebody found the performance
27+
of their website using [Tornado](https://www.tornadoweb.org/en/stable/) a lot
28+
worse than what various benchmarks suggested. It took some careful digging to
29+
figure out what caused the problem: The slow performance was caused by the huge
30+
functions that the Tornado templating engine creates. These functions lead the
31+
JIT to behave in unproductive ways. This blog post will be about how we fixed
32+
this problem.
33+
34+
# Problem
35+
36+
After quite a bit of debugging we narrowed down the problem to the following
37+
reproducer: If you render a big HTML template
38+
([example](https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291#file-index-html))
39+
using the Tornado templating engine, the template rendering is really not any
40+
faster than CPython. A small template doesn't show this behavior, and other
41+
parts of Tornado seem to perform well. So we looked into how the templating
42+
engine works, and it turns out that the templates are compiled into Python
43+
functions. This means that a big template can turn into a really enormous Python
44+
function ([Python version of the
45+
example](https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291#file-zz_autogenerated-py)).
46+
For some reason really enormous Python functions aren't handled particularly
47+
well by the JIT, and in the next section I'll explain some the background that's
48+
necessary to understand why this happens.
2849

2950
# Trace Limits and Inlining
3051

@@ -34,8 +55,9 @@ in, the reason for that is some limitation in the compact encoding of traces in
3455
the JIT. Another reason is that we don't want to generate arbitrary large chunks
3556
of machine code. Usually, when we hit the trace limit, it is due to *inlining*.
3657
While tracing, the JIT will inline many of the functions called from the
37-
outermost one. This can lead to the trace being too long. If that happens, we
38-
will mark a called function as uninlinable and the next time we trace the outer
58+
outermost one. This is usually good and improves performance greatly, however,
59+
inlining can also lead to the trace being too long. If that happens, we
60+
will mark a called function as uninlinable. The next time we trace the outer
3961
function we won't inline it, leading to a shorter trace, which hopefully fits
4062
the trace limit.
4163

@@ -47,28 +69,33 @@ disables inlining of `g`. The next time we try to trace `f` the trace will
4769
contain a *call* to `g` instead of inlining it. The trace ends up being not too
4870
long, so we can turn it into machine code when tracing finishes.
4971

50-
This is where the problem occurs: sometimes, the outermost function itself
72+
Now we know enough to understand what the problem with automatically generated
73+
code is: sometimes, the outermost function itself
5174
doesn't fit the trace limit, without any inlining going on at all. This is
5275
usually not the case for normal, hand-written Python functions. However, it can
5376
happen for automatically generated Python code, such as the code that the
5477
Tornado templating engine produces.
5578

56-
This is what used to happen in such a situation: the function is traced until
57-
the trace is too long. Then the trace limits stops further tracing. This happens
58-
again and again. The effect is that the function is even slowed down: we spend
59-
time tracing it, but that effort is never useful, so the resulting execution
60-
can be slower than not using the JIT at all!
79+
So, what happens when the JIT hits such a huge function? The function is traced
80+
until the trace is too long. Then the trace limits stops further tracing. Since
81+
nothing was inlined, we cannot make the trace shorter the next time by disabling
82+
inlining. Therefore, this happens again and again, the next time we trace the
83+
function we run into exactly the same problem. The net effect is that the
84+
function is even slowed down: we spend time tracing it, then stop tracing and
85+
throw the trace away. Therefore, that effort is never useful, so the resulting
86+
execution can be slower than not using the JIT at all!
6187

6288

6389
# Solution
6490

6591
To get out of the endless cycle of useless retracing we first had the idea of
66-
simply disabling all code generation for such functions, that produce too long
92+
simply disabling all code generation for such huge functions, that produce too long
6793
traces even if there is no inlining at all. However, that lead to disappointing
68-
performance, because important parts of the code were always interpreted.
94+
performance in the example Tornado program, because important parts of the code
95+
remain always interpreted.
6996

7097
Instead, our solution is now as follows: After we have hit the trace limit and
71-
no inlining has happened so far, we mark that function as a source of huge
98+
no inlining has happened so far, we mark the outermost function as a source of huge
7299
traces. The next time we trace such a function, we do so in a special mode. In
73100
that mode, hitting the trace limit behaves differently: Instead of stopping the
74101
tracer and throwing away the trace produced so far, we will use the unfinished
@@ -77,8 +104,9 @@ function, but stops at a basically arbitrary point in the middle of the
77104
function.
78105

79106
The question is what should happen when execution
80-
reaches the end of this unfinished trace. We want to be able to extend the trace
81-
from that point and add another piece of machine code, but not do that too
107+
reaches the end of this unfinished trace. We want to be able to cover more of
108+
the function with machine code and therefore need to extend the trace
109+
from that point on. But we don't want to do that too
82110
eagerly to prevent lots and lots of machine code being generated. To achieve
83111
this behaviour we add a guard to the end of the unfinished trace, which will
84112
always fail. This has the right behaviour: a failing guard will transfer control
@@ -105,13 +133,14 @@ closing the loop and jumping back to trace 1, or by returning from `f`).
105133

106134
# Evaluation
107135

108-
Since this is a performance cliff that we didn't observe in any of our own
109-
benchmarks ourselves, it's pointless to look at its effect on existing
110-
benchmarks – there shouldn't and indeed there isn't any.
136+
Since this is a performance cliff that we didn't observe in any of our
137+
[benchmarks](http://speed.pypy.org/) ourselves, it's pointless to look at the
138+
effect that this improvement has on existing benchmarks – there shouldn't and
139+
indeed there isn't any.
111140

112141
Instead, we are going to look at a micro-benchmark that came out of the
113-
original bug report, one that simply renders a big artificial Tornado template.
114-
The code of the micro-benchmark can be found
142+
original bug report, one that simply renders a big artificial Tornado template
143+
200 times. The code of the micro-benchmark can be found
115144
[here](https://gist.github.com/cfbolz/4a346d104fee41affc860a7b928b7291).
116145

117146
All benchmarks were run 10 times in new processes. The means and standard
@@ -137,8 +166,8 @@ and for how many traces we produced actual machine code:
137166
| PyPy3 JIT new | 30 | 25 | 0.06s |
138167

139168
Here we can clearly see the problem: The old JIT would try tracing the
140-
auto-generated code by the template again and again, but would never produce a
141-
useful trace, wasting lots of time in the process. The new JIT still traces a
169+
auto-generated templating code again and again, but would never actually produce
170+
any machine code, wasting lots of time in the process. The new JIT still traces a
142171
few times uselessly, but then eventually converges and stops emitting machine
143172
code for all the paths through the auto-generated Python code.
144173

@@ -235,6 +264,9 @@ have such mechanisms as well).
235264
In this post we've described a performance cliff in PyPy's JIT, that of really
236265
big auto-generated functions which hit the trace limit without inlining, that we
237266
still want to generate machine code for. We achieve this by chunking up the
238-
trace into several smaller cases, which we added piece by piece. The work is a
239-
tiny bit experimental still, but we will release it as part of the upcoming 3.8
240-
beta release, to get some more experience with it.
267+
trace into several smaller traces, which we compile piece by piece. The work
268+
described in this post tiny bit experimental still, but we will release it as
269+
part of the upcoming 3.8 beta release, to get some more experience with it.
270+
Please grab a [3.8 release
271+
candidate](https://mail.python.org/pipermail/pypy-dev/2021-September/016214.html),
272+
try it out and let us know your observations, good and bad!

0 commit comments

Comments
 (0)