Skip to content

Commit cdc2f07

Browse files
committed
a blog post draft on testing PyPy
1 parent 8e09d8c commit cdc2f07

File tree

1 file changed

+290
-0
lines changed

1 file changed

+290
-0
lines changed

posts/2022/03/how-is-pypy-tested.rst

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
.. title: How is PyPy Tested?
2+
.. slug: how-is-pypy-tested
3+
.. date: 2022-03-02 12:00:00 UTC
4+
.. tags:
5+
.. category:
6+
.. link:
7+
.. description:
8+
.. type: rest
9+
.. author: Carl Friedrich Bolz-Tereick
10+
11+
===================
12+
How is PyPy Tested?
13+
===================
14+
15+
In this post I want to give an overview of how the PyPy project does and thinks
16+
about testing. PyPy takes testing quite seriously and has done some from the
17+
start of the project. In the post I want to present the different styles of
18+
tests that PyPy has, when we use them and how I think about them.
19+
20+
21+
Background
22+
============
23+
24+
To make the blog post self-contained, I am going to start with a small overview
25+
about PyPy's architecture. If you already know what PyPy is and how it works,
26+
you can skip this section.
27+
28+
PyPy means "Python in Python". It is an alternative implementation of the Python
29+
language. Usually, when we speak of "Python", we can mean two different things.
30+
On the one hand it means "Python as an abstract programming language". On the
31+
other hand, the main implementation of that language is also often called
32+
"Python". To more clearly distinguish the two, the implementation is often also
33+
called "CPython", because it is an interpreter implemented in C code.
34+
35+
Now we can make the statement "PyPy is Python in Python" more precise: PyPy is
36+
an interpreter for Python 3.9, implemented in RPython. RPython ("Restricted
37+
Python") is a subset of Python 2, which is statically typed and can be compiled
38+
to C code. That means we can take our Python 3.9 interpreter, and compile it
39+
into a C binary that can run Python 3.9 code. The final binary behaves pretty
40+
similarly to CPython.
41+
42+
The main thing that makes PyPy interesting is that during the translation of our
43+
interpreter to C, a number of components are automatically inserted into the
44+
final binary. One component is a reasonably good garbage collector.
45+
46+
The more exciting component that is inserted into the binary is a just-in-time
47+
compiler. The insertion of this component is not fully automatic, instead it is
48+
guided by a small number of annotations in the source code of the interpreter.
49+
The effect of inserting this JIT compiler into the binary is that the resulting
50+
binary can run Python code significantly faster than CPython, in many cases.
51+
How this works is not important for the rest of the post, if you want to see an
52+
example of concretely doing that to a small interpreter you can look at this
53+
video_.
54+
55+
.. _video: https://www.youtube.com/watch?v=fZj3uljJl_k
56+
57+
58+
PyPy Testing History
59+
=====================
60+
61+
A few historical notes on the PyPy project and its relationship to testing: The
62+
PyPy project `was started in 2004`_. At the time when the project was started,
63+
Extreme Programming and Agile Software Development where up and coming. On the
64+
methodology side, PyPy was heavily influenced by these, and started using
65+
Test-Driven Development and pair programming right from the start.
66+
67+
.. _`was started in 2004`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html
68+
69+
Also technologically, PyPy has been influential on testing in the Python world.
70+
Originally, PyPy had used the ``unittest`` testing framework, but pretty soon
71+
the developers got frustrated with it. `Holger Krekel`_, one of the original
72+
developers who started PyPy, started the pytest_ testing framework soon
73+
afterwards.
74+
75+
.. _`Holger Krekel`: https://holgerkrekel.net/
76+
.. _`pytest`: https://pytest.org/
77+
78+
79+
Interpreter-Level Tests
80+
=========================
81+
82+
So, how are tests for PyPy written, concretely? The tests for the interpreter
83+
are split into two different kinds, which we call "interpreter level tests" and
84+
"application level tests". The former are tests that can be used to test the
85+
objects and functions that are used in the implementation of the Python
86+
interpreter. Since the interpreter is written in Python 2, those tests are also
87+
written in Python 2, using pytest. They tend to be more on the unit test side of
88+
things. They are in files with the pattern ``test_*.py``.
89+
90+
Here is an example that tests the implementation of integers (very slightly
91+
simplified)::
92+
93+
class TestW_IntObject:
94+
...
95+
96+
def test_hash(self):
97+
w_x = W_IntObject(42)
98+
w_result = w_x.descr_hash(self.space)
99+
assert isinstance(w_result, W_IntObject)
100+
assert w_result.intval == 42
101+
102+
103+
This test checks that if you take an object that represents integers in the
104+
Python language (using the class ``W_IntObject``, a "wrapped integer object")
105+
with the value 42, computing the hash of that object returns another instance of
106+
the same class, also with the value 42.
107+
108+
These tests can be run on top of any Python 2 implementation, either CPython or
109+
PyPy. We can then test and debug the internals of the PyPy interpreter using
110+
familiar tools like indeed pytest and the Python debuggers.
111+
112+
In CPython, these tests don't really have an equivalent. They would correspond
113+
to tests that are written in C and that can access test the logic of all the C
114+
functions of CPython that execute certain functionality, accessing the internals
115+
of C structs in the process.
116+
117+
118+
Application-Level Tests
119+
=========================
120+
121+
There is also a second class of tests for the interpreter. Those are tests that
122+
don't run on the level of the implementation. Instead, they are executed *by*
123+
the PyPy Python interpreter, thus running on the level of the applications run
124+
by PyPy. Since the interpreter is running Python 3, the tests are also written
125+
in Python 3. They are stored in files with the pattern ``apptest_*.py`` _[#] and
126+
look like "regular" Python 3 tests.
127+
128+
.. [#] There is also a deprecated different way to write these tests, by putting
129+
them in the ``test_*.py`` files that interpreter level tests are using and
130+
then having a test class with the pattern ``class AppTest*``. We haven't
131+
converted all of them to the new style yet, even though the old style is
132+
quite weird: since the ``test_*.py`` files are themselves parsed by
133+
Python 2, the tests methods in ``AppTest*`` classes need to be written in the
134+
subset of Python 3 that is also valid Python 2 syntax, leading to a lot of
135+
confusion.
136+
137+
Here's an example of how you could write a test equivalent to the one above::
138+
139+
def test_hash():
140+
assert hash(42) == 42
141+
142+
This style of test looks more "natural" and is the preferred one in cases where
143+
the test does not need to access the internals of the logic or the objects of
144+
the interpreter.
145+
146+
Application level tests can be run in two different ways. On the one hand, we
147+
can simply run them on CPython 3. This is very useful! Since we want PyPy to
148+
behave like CPython, running the tests that we write on CPython is useful to
149+
make sure that the tests themselves aren't wrong.
150+
151+
On the other hand, the main way to run these tests is on top of PyPy, itself
152+
running on top of a Python 2 implementation. This makes it possible to run the
153+
test without first bootstrapping PyPy to C. Since bootstrapping to C is a
154+
relatively slow operation (can take up to an hour) it is crucially important to
155+
be able to run tests without bootstrapping first. It also again makes it
156+
possible to debug crashes in the interpreter using the regular Python 2
157+
debugger. Of course running tests in this way is unfortunately itself not super
158+
fast, given that they run on a stack of two different interpreters.
159+
160+
Application-level tests correspond quite closely to CPython's tests suite (which
161+
is using the unittest framework). Of course in CPython it is not possible to run
162+
the test suite without building the CPython binary using a C compiler. _[#]
163+
164+
.. [#] Nit-picky side-note: `C interpreters`_ `are a thing`_! But not that
165+
widely used in practice, or only in very specific situations.
166+
167+
.. _`C interpreters`: https://root.cern.ch/root/html534/guides/users-guide/CINT.html
168+
.. _`are a thing`: https://www.youtube.com/watch?v=yyDD_KRdQQU
169+
170+
So when do we write application-level tests, and when interpreter-level tests?
171+
Interpreter-level tests are necessary to test internal data structures that
172+
touch data and logic that is not directly exposed to the Python language. If
173+
that is not necessary, we try to write application-level tests. App-level tests
174+
are however by their nature always more on the integration test side of things.
175+
To be able to run the ``test_hash`` function above, many parts of PyPy need to
176+
work correctly, the parser, the bytecode compiler, the bytecode interpreter, the
177+
``hash`` builtin, calling the ``__hash__`` special method, etc, etc.
178+
179+
This observation is also true for CPython! One could argue that CPython has no
180+
unit tests at all, because in order to be able to even run the tests, most of
181+
Python needs to be in working order already, so all the tests are really
182+
implicitly integration tests.
183+
184+
185+
The CPython Test Suite
186+
========================
187+
188+
We also use the CPython Test suite as a final check to see whether our
189+
interpreter correctly implements all the features of the Python language. In
190+
that sense it acts as some kind of compliance test suite that checks whether we
191+
implement the language correctly. The test suite is not perfect for this
192+
purpose. Since it is written for CPython's purposes during its development, a
193+
lot of the tests check really specific CPython implementation details. Examples
194+
for these are tests that check that ``__del__`` is called immediately after
195+
objects go out of scope (which only happens if you use reference counting as a
196+
garbage collection strategy, which PyPy doesn't do). Other examples are checking
197+
for exception error messages very explicitly. However, the CPython test suite
198+
has gotten a lot better in these regards over time, by adding
199+
``support.gc_collect()`` calls to fix the former problem, and by marking some
200+
very specific tests with the ``@impl_detail`` decorator. Thanks to all the
201+
CPython developers who have worked on this!
202+
203+
In the process of re-implementing CPython's functionality and running CPython's
204+
tests suite, PyPy can often also be a good way to find bugs in CPython. While we
205+
think about the corner cases of some Python feature we occasionally find
206+
situations where CPython didn't get everything completely correct either, which
207+
we then report back.
208+
209+
210+
Testing Performance
211+
=====================
212+
213+
All the tests we described so far are checking *behaviour*. But one of PyPy's
214+
important goals is to be a *fast* implementation not "just" a correct one. Some
215+
aspects of performance can be tested by regular unit tests, either application-
216+
or interpreter-level. In order to check whether some performance shortcut is
217+
taken in the interpreter, we sometimes can write tests that monkeypatch the slow
218+
default implementation to always error. Then, if the fast path is taken
219+
properly, that slow default implementation is never reached.
220+
221+
But we also have additional tests that test the correct interaction with the JIT
222+
explicitly. For that, we have a special style of test that checks that the JIT
223+
will produce the correct machine code for a small snippet of Python code. To
224+
make this kind of test somewhat more robust, we don't check the machine code
225+
directly, but instead the architecture independent `intermediate
226+
representation`_ that the JIT uses to produce machine code from.
227+
228+
.. _`intermediate representation`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html
229+
230+
As an example, here is a small test that loading the attribute of a constant
231+
global instance can be completely constant folded away::
232+
233+
def test_load_attr(self):
234+
src = '''
235+
class A(object):
236+
pass
237+
a = A()
238+
a.x = 1
239+
def main(n):
240+
i = 0
241+
while i < n:
242+
i = i + a.x
243+
return i
244+
'''
245+
log = self.run(src, [1000])
246+
assert log.result == 1000
247+
loop, = log.loops_by_filename(self.filepath)
248+
assert loop.match("""
249+
i9 = int_lt(i5, i6)
250+
guard_true(i9, descr=...)
251+
guard_not_invalidated(descr=...)
252+
i10 = int_add(i5, 1)
253+
--TICK--
254+
jump(..., descr=...)
255+
""")
256+
257+
The string passed to the ``loop.match`` function is a string representation of
258+
the intermediate representation code that is generated for the ``while`` loop in
259+
the ``main`` function given in the source. The important part of that
260+
intermediate representation is that the ``i = i + a.x`` addition is optimized
261+
into an ``int_add(x, 1)`` operation. The second argument for the addition is the
262+
constant ``1``, because the JIT noted that the global ``a`` is a constant, and
263+
the attribute ``x`` of that instance is always ``1``. The test thus checks that
264+
this optimization still works.
265+
266+
Those tests are again more on the unit test side of things (and can thus
267+
unfortunately be a bit brittle sometimes and break). The integration test
268+
equivalent for performance is the `PyPy Speed Center`_ which tracks the
269+
performance of micro- and macro-benchmarks over time and lets us see when big
270+
performance regressions are happening. The speed center is not really an
271+
automatic test and does not produce pass/fail outcomes. Instead, it requires
272+
human judgement and intervention in order to interpret the performance changes.
273+
Having a real pass/fail mechanism is something that would be `great to have`_
274+
but is probably `quite tricky in practice`_.
275+
276+
.. _`great to have`: https://twitter.com/glyph/status/1495122754286198790
277+
.. _`quite tricky in practice`: https://arxiv.org/abs/1602.00602
278+
279+
.. _`PyPy Speed Center`: https://speed.pypy.org/
280+
281+
282+
Conclusion
283+
===========
284+
285+
This concludes my overview of some of the different styles of tests that we use
286+
to develop the PyPy Python interpreter.
287+
288+
There is a whole other set of tests for the development of the RPython language,
289+
the garbage collectors it provides as well as the code that does the automatic
290+
JIT insertion, maybe I'll cover these in a future post.

0 commit comments

Comments
 (0)