Skip to content

Commit 69789d5

Browse files
authored
Merge pull request #55 from pypy/pypy-testing
a blog post draft on testing PyPy
2 parents 6fc0935 + b4f3100 commit 69789d5

File tree

1 file changed

+320
-0
lines changed

1 file changed

+320
-0
lines changed

posts/2022/04/how-is-pypy-tested.rst

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
.. title: How is PyPy Tested?
2+
.. slug: how-is-pypy-tested
3+
.. date: 2022-04-02 15:00:00 UTC
4+
.. tags:
5+
.. category:
6+
.. link:
7+
.. description:
8+
.. type: rest
9+
.. author: Carl Friedrich Bolz-Tereick
10+
11+
===================
12+
How is PyPy Tested?
13+
===================
14+
15+
In this post I want to give an overview of how the PyPy project does and thinks
16+
about testing. PyPy takes testing quite seriously and has done some from the
17+
start of the project. Here I want to present the different styles of
18+
tests that PyPy has, when we use them and how I think about them.
19+
20+
21+
Background
22+
============
23+
24+
To make the blog post self-contained, I am going to start with a small overview
25+
about PyPy's architecture. If you already know what PyPy is and how it works,
26+
you can skip this section.
27+
28+
PyPy means "Python in Python". It is an alternative implementation of the Python
29+
language. Usually, when we speak of "Python", we can mean two different things.
30+
On the one hand it means "Python as an abstract programming language". On the
31+
other hand, the main implementation of that language is also often called
32+
"Python". To more clearly distinguish the two, the implementation is often also
33+
called "CPython", because it is an interpreter implemented in C code.
34+
35+
Now we can make the statement "PyPy is Python in Python" more precise: PyPy is
36+
an interpreter for Python 3.9, implemented in RPython. RPython ("Restricted
37+
Python") is a subset of Python 2, which is statically typed (using type
38+
inference, not type annotations) and can be compiled
39+
to C code. That means we can take our Python 3.9 interpreter, and compile it
40+
into a C binary that can run Python 3.9 code. The final binary behaves pretty
41+
similarly to CPython.
42+
43+
The main thing that makes PyPy interesting is that during the translation of our
44+
interpreter to C, a number of components are automatically inserted into the
45+
final binary. One component is a reasonably good garbage collector.
46+
47+
The more exciting component that is inserted into the binary is a just-in-time
48+
compiler. The insertion of this component is not fully automatic, instead it is
49+
guided by a small number of annotations in the source code of the interpreter.
50+
The effect of inserting this JIT compiler into the binary is that the resulting
51+
binary can run Python code significantly faster than CPython, in many cases.
52+
How this works is not important for the rest of the post, if you want to see an
53+
example of concretely doing that to a small interpreter you can look at this
54+
video_.
55+
56+
.. _video: https://www.youtube.com/watch?v=fZj3uljJl_k
57+
58+
59+
PyPy Testing History
60+
=====================
61+
62+
A few historical notes on the PyPy project and its relationship to testing: The
63+
PyPy project `was started in 2004`_. At the time when the project was started,
64+
Extreme Programming and Agile Software Development where up and coming. On the
65+
methodology side, PyPy was heavily influenced by these, and started using
66+
Test-Driven Development and pair programming right from the start.
67+
68+
.. _`was started in 2004`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html
69+
70+
Also technologically, PyPy has been influential on testing in the Python world.
71+
Originally, PyPy had used the ``unittest`` testing framework, but pretty soon
72+
the developers got frustrated with it. `Holger Krekel`_, one of the original
73+
developers who started PyPy, started the pytest_ testing framework soon
74+
afterwards.
75+
76+
.. _`Holger Krekel`: https://holgerkrekel.net/
77+
.. _`pytest`: https://pytest.org/
78+
79+
80+
Interpreter-Level Tests
81+
=========================
82+
83+
So, how are tests for PyPy written, concretely? The tests for the interpreter
84+
are split into two different kinds, which we call "interpreter level tests" and
85+
"application level tests". The former are tests that can be used to test the
86+
objects and functions that are used in the implementation of the Python
87+
interpreter. Since the interpreter is written in Python 2, those tests are also
88+
written in Python 2, using pytest. They tend to be more on the unit test side of
89+
things. They are in files with the pattern ``test_*.py``.
90+
91+
Here is an example that tests the implementation of integers (very slightly
92+
simplified):
93+
94+
.. code:: python
95+
96+
class TestW_IntObject:
97+
...
98+
99+
def test_hash(self):
100+
w_x = W_IntObject(42)
101+
w_result = w_x.descr_hash(self.space)
102+
assert isinstance(w_result, W_IntObject)
103+
assert w_result.intval == 42
104+
105+
106+
This test checks that if you take an object that represents integers in the
107+
Python language (using the class ``W_IntObject``, a "wrapped integer object")
108+
with the value 42, computing the hash of that object returns another instance of
109+
the same class, also with the value 42.
110+
111+
These tests can be run on top of any Python 2 implementation, either CPython or
112+
PyPy. We can then test and debug the internals of the PyPy interpreter using
113+
familiar tools like indeed pytest and the Python debuggers. They can be run,
114+
because all the involved code like the tests and the class ``W_IntObject`` are
115+
just completely regular Python 2 classes that behave in the regular way when
116+
run on top of a Python interpreter.
117+
118+
In CPython, these tests don't really have an equivalent. They would correspond
119+
to tests that are written in C and that can test the logic of all the C
120+
functions of CPython that execute certain functionality, accessing the internals
121+
of C structs in the process. `¹`_
122+
123+
124+
Application-Level Tests
125+
=========================
126+
127+
There is also a second class of tests for the interpreter. Those are tests that
128+
don't run on the level of the implementation. Instead, they are executed *by*
129+
the PyPy Python interpreter, thus running on the level of the applications run
130+
by PyPy. Since the interpreter is running Python 3, the tests are also written
131+
in Python 3. They are stored in files with the pattern ``apptest_*.py`` and
132+
look like "regular" Python 3 tests. `²`_
133+
134+
Here's an example of how you could write a test equivalent to the one above:
135+
136+
.. code:: python
137+
138+
def test_hash():
139+
assert hash(42) == 42
140+
141+
This style of test looks more "natural" and is the preferred one in cases where
142+
the test does not need to access the internals of the logic or the objects of
143+
the interpreter.
144+
145+
Application level tests can be run in two different ways. On the one hand, we
146+
can simply run them on CPython 3. This is very useful! Since we want PyPy to
147+
behave like CPython, running the tests that we write on CPython is useful to
148+
make sure that the tests themselves aren't wrong.
149+
150+
On the other hand, the main way to run these tests is on top of PyPy, itself
151+
running on top of a Python 2 implementation. This makes it possible to run the
152+
test without first bootstrapping PyPy to C. Since bootstrapping to C is a
153+
relatively slow operation (can take up to an hour) it is crucially important to
154+
be able to run tests without bootstrapping first. It also again makes it
155+
possible to debug crashes in the interpreter using the regular Python 2
156+
debugger. Of course running tests in this way is unfortunately itself not super
157+
fast, given that they run on a stack of two different interpreters.
158+
159+
Application-level tests correspond quite closely to CPython's tests suite (which
160+
is using the unittest framework). Of course in CPython it is not possible to run
161+
the test suite without building the CPython binary using a C compiler. `³`_
162+
163+
So when do we write application-level tests, and when interpreter-level tests?
164+
Interpreter-level tests are necessary to test internal data structures that
165+
touch data and logic that is not directly exposed to the Python language. If
166+
that is not necessary, we try to write application-level tests. App-level tests
167+
are however by their nature always more on the integration test side of things.
168+
To be able to run the ``test_hash`` function above, many parts of PyPy need to
169+
work correctly, the parser, the bytecode compiler, the bytecode interpreter, the
170+
``hash`` builtin, calling the ``__hash__`` special method, etc, etc.
171+
172+
This observation is also true for CPython! One could argue that CPython has no
173+
unit tests at all, because in order to be able to even run the tests, most of
174+
Python needs to be in working order already, so all the tests are really
175+
implicitly integration tests.
176+
177+
178+
The CPython Test Suite
179+
========================
180+
181+
We also use the CPython Test suite as a final check to see whether our
182+
interpreter correctly implements all the features of the Python language. In
183+
that sense it acts as some kind of compliance test suite that checks whether we
184+
implement the language correctly. The test suite is not perfect for this.
185+
Since it is written for CPython's purposes during its development, a
186+
lot of the tests check really specific CPython implementation details. Examples
187+
for these are tests that check that ``__del__`` is called immediately after
188+
objects go out of scope (which only happens if you use reference counting as a
189+
garbage collection strategy, PyPy uses a `different approach to garbage
190+
collection`_). Other examples are checking
191+
for exception error messages very explicitly. However, the CPython test suite
192+
has gotten a lot better in these regards over time, by adding
193+
``support.gc_collect()`` calls to fix the former problem, and by marking some
194+
very specific tests with the ``@impl_detail`` decorator. Thanks to all the
195+
CPython developers who have worked on this!
196+
197+
.. _`different approach to garbage collection`: https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html
198+
199+
In the process of re-implementing CPython's functionality and running CPython's
200+
tests suite, PyPy can often also be a good way to find bugs in CPython. While we
201+
think about the corner cases of some Python feature we occasionally find
202+
situations where CPython didn't get everything completely correct either, which
203+
we then report back.
204+
205+
206+
Testing for Performance Regressions
207+
====================================
208+
209+
All the tests we described so far are checking *behaviour*. But one of PyPy's
210+
important goals is to be a *fast* implementation not "just" a correct one. Some
211+
aspects of performance can be tested by regular unit tests, either application-
212+
or interpreter-level. In order to check whether some performance shortcut is
213+
taken in the interpreter, we sometimes can write tests that monkeypatch the slow
214+
default implementation to always error. Then, if the fast path is taken
215+
properly, that slow default implementation is never reached.
216+
217+
But we also have additional tests that test the correct interaction with the JIT
218+
explicitly. For that, we have a special style of test that checks that the JIT
219+
will produce the correct machine code for a small snippet of Python code. To
220+
make this kind of test somewhat more robust, we don't check the machine code
221+
directly, but instead the architecture independent `intermediate
222+
representation`_ that the JIT uses to produce machine code from.
223+
224+
.. _`intermediate representation`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html
225+
226+
As an example, here is a small test that loading the attribute of a constant
227+
global instance can be completely constant folded away:
228+
229+
.. code:: python
230+
231+
def test_load_attr(self):
232+
src = '''
233+
class A(object):
234+
pass
235+
a = A()
236+
a.x = 1
237+
def main(n):
238+
i = 0
239+
while i < n:
240+
i = i + a.x
241+
return i
242+
'''
243+
log = self.run(src, [1000])
244+
assert log.result == 1000
245+
loop, = log.loops_by_filename(self.filepath)
246+
assert loop.match("""
247+
i9 = int_lt(i5, i6)
248+
guard_true(i9, descr=...)
249+
guard_not_invalidated(descr=...)
250+
i10 = int_add(i5, 1)
251+
--TICK--
252+
jump(..., descr=...)
253+
""")
254+
255+
The string passed to the ``loop.match`` function is a string representation of
256+
the intermediate representation code that is generated for the ``while`` loop in
257+
the ``main`` function given in the source. The important part of that
258+
intermediate representation is that the ``i = i + a.x`` addition is optimized
259+
into an ``int_add(x, 1)`` operation. The second argument for the addition is the
260+
constant ``1``, because the JIT noted that the global ``a`` is a constant, and
261+
the attribute ``x`` of that instance is always ``1``. The test thus checks that
262+
this optimization still works.
263+
264+
Those tests are again more on the unit test side of things (and can thus
265+
unfortunately be a bit brittle sometimes and break). The integration test
266+
equivalent for performance is the `PyPy Speed Center`_ which tracks the
267+
performance of micro- and macro-benchmarks over time and lets us see when big
268+
performance regressions are happening. The speed center is not really an
269+
automatic test and does not produce pass/fail outcomes. Instead, it requires
270+
human judgement and intervention in order to interpret the performance changes.
271+
Having a real pass/fail mechanism is something that would be `great to have`_
272+
but is probably `quite tricky in practice`_.
273+
274+
.. _`great to have`: https://twitter.com/glyph/status/1495122754286198790
275+
.. _`quite tricky in practice`: https://arxiv.org/abs/1602.00602
276+
277+
.. _`PyPy Speed Center`: https://speed.pypy.org/
278+
279+
280+
Conclusion
281+
===========
282+
283+
This concludes my overview of some of the different styles of tests that we use
284+
to develop the PyPy Python interpreter.
285+
286+
There is a whole other set of tests for the development of the RPython language,
287+
the garbage collectors it provides as well as the code that does the automatic
288+
JIT insertion, maybe I'll cover these in a future post.
289+
290+
291+
Footnotes
292+
-----------
293+
294+
.. _`¹`:
295+
296+
¹ CPython has the `_testcapimodule.c` and related modules, that are used to
297+
unit-test the C-API. However, these are still driven from Python tests using
298+
the ``unittest`` framework and wouldn't run without the Python interpreter
299+
already working.
300+
301+
302+
.. _`²`:
303+
304+
² There is also a deprecated different way to write these tests, by putting
305+
them in the ``test_*.py`` files that interpreter level tests are using and
306+
then having a test class with the pattern ``class AppTest*``. We haven't
307+
converted all of them to the new style yet, even though the old style is
308+
quite weird: since the ``test_*.py`` files are themselves parsed by
309+
Python 2, the tests methods in ``AppTest*`` classes need to be written in the
310+
subset of Python 3 syntax that is also valid Python 2 syntax, leading to a lot
311+
of confusion.
312+
313+
.. _`³`:
314+
315+
³ Nit-picky side-note: `C interpreters`_ `are a thing`_! But not that
316+
widely used in practice, or only in very specific situations.
317+
318+
.. _`C interpreters`: https://root.cern.ch/root/html534/guides/users-guide/CINT.html
319+
.. _`are a thing`: https://www.youtube.com/watch?v=yyDD_KRdQQU
320+

0 commit comments

Comments
 (0)