|
| 1 | +.. title: How is PyPy Tested? |
| 2 | +.. slug: how-is-pypy-tested |
| 3 | +.. date: 2022-03-02 12:00:00 UTC |
| 4 | +.. tags: |
| 5 | +.. category: |
| 6 | +.. link: |
| 7 | +.. description: |
| 8 | +.. type: rest |
| 9 | +.. author: Carl Friedrich Bolz-Tereick |
| 10 | +
|
| 11 | +=================== |
| 12 | +How is PyPy Tested? |
| 13 | +=================== |
| 14 | + |
| 15 | +In this post I want to give an overview of how the PyPy project does and thinks |
| 16 | +about testing. PyPy takes testing quite seriously and has done some from the |
| 17 | +start of the project. In the post I want to present the different styles of |
| 18 | +tests that PyPy has, when we use them and how I think about them. |
| 19 | + |
| 20 | + |
| 21 | +Background |
| 22 | +============ |
| 23 | + |
| 24 | +To make the blog post self-contained, I am going to start with a small overview |
| 25 | +about PyPy's architecture. If you already know what PyPy is and how it works, |
| 26 | +you can skip this section. |
| 27 | + |
| 28 | +PyPy means "Python in Python". It is an alternative implementation of the Python |
| 29 | +language. Usually, when we speak of "Python", we can mean two different things. |
| 30 | +On the one hand it means "Python as an abstract programming language". On the |
| 31 | +other hand, the main implementation of that language is also often called |
| 32 | +"Python". To more clearly distinguish the two, the implementation is often also |
| 33 | +called "CPython", because it is an interpreter implemented in C code. |
| 34 | + |
| 35 | +Now we can make the statement "PyPy is Python in Python" more precise: PyPy is |
| 36 | +an interpreter for Python 3.9, implemented in RPython. RPython ("Restricted |
| 37 | +Python") is a subset of Python 2, which is statically typed and can be compiled |
| 38 | +to C code. That means we can take our Python 3.9 interpreter, and compile it |
| 39 | +into a C binary that can run Python 3.9 code. The final binary behaves pretty |
| 40 | +similarly to CPython. |
| 41 | + |
| 42 | +The main thing that makes PyPy interesting is that during the translation of our |
| 43 | +interpreter to C, a number of components are automatically inserted into the |
| 44 | +final binary. One component is a reasonably good garbage collector. |
| 45 | + |
| 46 | +The more exciting component that is inserted into the binary is a just-in-time |
| 47 | +compiler. The insertion of this component is not fully automatic, instead it is |
| 48 | +guided by a small number of annotations in the source code of the interpreter. |
| 49 | +The effect of inserting this JIT compiler into the binary is that the resulting |
| 50 | +binary can run Python code significantly faster than CPython, in many cases. |
| 51 | +How this works is not important for the rest of the post, if you want to see an |
| 52 | +example of concretely doing that to a small interpreter you can look at this |
| 53 | +video_. |
| 54 | + |
| 55 | +.. _video: https://www.youtube.com/watch?v=fZj3uljJl_k |
| 56 | + |
| 57 | + |
| 58 | +PyPy Testing History |
| 59 | +===================== |
| 60 | + |
| 61 | +A few historical notes on the PyPy project and its relationship to testing: The |
| 62 | +PyPy project `was started in 2004`_. At the time when the project was started, |
| 63 | +Extreme Programming and Agile Software Development where up and coming. On the |
| 64 | +methodology side, PyPy was heavily influenced by these, and started using |
| 65 | +Test-Driven Development and pair programming right from the start. |
| 66 | + |
| 67 | +.. _`was started in 2004`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html |
| 68 | + |
| 69 | +Also technologically, PyPy has been influential on testing in the Python world. |
| 70 | +Originally, PyPy had used the ``unittest`` testing framework, but pretty soon |
| 71 | +the developers got frustrated with it. `Holger Krekel`_, one of the original |
| 72 | +developers who started PyPy, started the pytest_ testing framework soon |
| 73 | +afterwards. |
| 74 | + |
| 75 | +.. _`Holger Krekel`: https://holgerkrekel.net/ |
| 76 | +.. _`pytest`: https://pytest.org/ |
| 77 | + |
| 78 | + |
| 79 | +Interpreter-Level Tests |
| 80 | +========================= |
| 81 | + |
| 82 | +So, how are tests for PyPy written, concretely? The tests for the interpreter |
| 83 | +are split into two different kinds, which we call "interpreter level tests" and |
| 84 | +"application level tests". The former are tests that can be used to test the |
| 85 | +objects and functions that are used in the implementation of the Python |
| 86 | +interpreter. Since the interpreter is written in Python 2, those tests are also |
| 87 | +written in Python 2, using pytest. They tend to be more on the unit test side of |
| 88 | +things. They are in files with the pattern ``test_*.py``. |
| 89 | + |
| 90 | +Here is an example that tests the implementation of integers (very slightly |
| 91 | +simplified):: |
| 92 | + |
| 93 | + class TestW_IntObject: |
| 94 | + ... |
| 95 | + |
| 96 | + def test_hash(self): |
| 97 | + w_x = W_IntObject(42) |
| 98 | + w_result = w_x.descr_hash(self.space) |
| 99 | + assert isinstance(w_result, W_IntObject) |
| 100 | + assert w_result.intval == 42 |
| 101 | + |
| 102 | + |
| 103 | +This test checks that if you take an object that represents integers in the |
| 104 | +Python language (using the class ``W_IntObject``, a "wrapped integer object") |
| 105 | +with the value 42, computing the hash of that object returns another instance of |
| 106 | +the same class, also with the value 42. |
| 107 | + |
| 108 | +These tests can be run on top of any Python 2 implementation, either CPython or |
| 109 | +PyPy. We can then test and debug the internals of the PyPy interpreter using |
| 110 | +familiar tools like indeed pytest and the Python debuggers. |
| 111 | + |
| 112 | +In CPython, these tests don't really have an equivalent. They would correspond |
| 113 | +to tests that are written in C and that can access test the logic of all the C |
| 114 | +functions of CPython that execute certain functionality, accessing the internals |
| 115 | +of C structs in the process. |
| 116 | + |
| 117 | + |
| 118 | +Application-Level Tests |
| 119 | +========================= |
| 120 | + |
| 121 | +There is also a second class of tests for the interpreter. Those are tests that |
| 122 | +don't run on the level of the implementation. Instead, they are executed *by* |
| 123 | +the PyPy Python interpreter, thus running on the level of the applications run |
| 124 | +by PyPy. Since the interpreter is running Python 3, the tests are also written |
| 125 | +in Python 3. They are stored in files with the pattern ``apptest_*.py`` _[#] and |
| 126 | +look like "regular" Python 3 tests. |
| 127 | + |
| 128 | +.. [#] There is also a deprecated different way to write these tests, by putting |
| 129 | + them in the ``test_*.py`` files that interpreter level tests are using and |
| 130 | + then having a test class with the pattern ``class AppTest*``. We haven't |
| 131 | + converted all of them to the new style yet, even though the old style is |
| 132 | + quite weird: since the ``test_*.py`` files are themselves parsed by |
| 133 | + Python 2, the tests methods in ``AppTest*`` classes need to be written in the |
| 134 | + subset of Python 3 that is also valid Python 2 syntax, leading to a lot of |
| 135 | + confusion. |
| 136 | +
|
| 137 | +Here's an example of how you could write a test equivalent to the one above:: |
| 138 | + |
| 139 | + def test_hash(): |
| 140 | + assert hash(42) == 42 |
| 141 | + |
| 142 | +This style of test looks more "natural" and is the preferred one in cases where |
| 143 | +the test does not need to access the internals of the logic or the objects of |
| 144 | +the interpreter. |
| 145 | + |
| 146 | +Application level tests can be run in two different ways. On the one hand, we |
| 147 | +can simply run them on CPython 3. This is very useful! Since we want PyPy to |
| 148 | +behave like CPython, running the tests that we write on CPython is useful to |
| 149 | +make sure that the tests themselves aren't wrong. |
| 150 | + |
| 151 | +On the other hand, the main way to run these tests is on top of PyPy, itself |
| 152 | +running on top of a Python 2 implementation. This makes it possible to run the |
| 153 | +test without first bootstrapping PyPy to C. Since bootstrapping to C is a |
| 154 | +relatively slow operation (can take up to an hour) it is crucially important to |
| 155 | +be able to run tests without bootstrapping first. It also again makes it |
| 156 | +possible to debug crashes in the interpreter using the regular Python 2 |
| 157 | +debugger. Of course running tests in this way is unfortunately itself not super |
| 158 | +fast, given that they run on a stack of two different interpreters. |
| 159 | + |
| 160 | +Application-level tests correspond quite closely to CPython's tests suite (which |
| 161 | +is using the unittest framework). Of course in CPython it is not possible to run |
| 162 | +the test suite without building the CPython binary using a C compiler. _[#] |
| 163 | + |
| 164 | +.. [#] Nit-picky side-note: `C interpreters`_ `are a thing`_! But not that |
| 165 | + widely used in practice, or only in very specific situations. |
| 166 | +
|
| 167 | +.. _`C interpreters`: https://root.cern.ch/root/html534/guides/users-guide/CINT.html |
| 168 | +.. _`are a thing`: https://www.youtube.com/watch?v=yyDD_KRdQQU |
| 169 | + |
| 170 | +So when do we write application-level tests, and when interpreter-level tests? |
| 171 | +Interpreter-level tests are necessary to test internal data structures that |
| 172 | +touch data and logic that is not directly exposed to the Python language. If |
| 173 | +that is not necessary, we try to write application-level tests. App-level tests |
| 174 | +are however by their nature always more on the integration test side of things. |
| 175 | +To be able to run the ``test_hash`` function above, many parts of PyPy need to |
| 176 | +work correctly, the parser, the bytecode compiler, the bytecode interpreter, the |
| 177 | +``hash`` builtin, calling the ``__hash__`` special method, etc, etc. |
| 178 | + |
| 179 | +This observation is also true for CPython! One could argue that CPython has no |
| 180 | +unit tests at all, because in order to be able to even run the tests, most of |
| 181 | +Python needs to be in working order already, so all the tests are really |
| 182 | +implicitly integration tests. |
| 183 | + |
| 184 | + |
| 185 | +The CPython Test Suite |
| 186 | +======================== |
| 187 | + |
| 188 | +We also use the CPython Test suite as a final check to see whether our |
| 189 | +interpreter correctly implements all the features of the Python language. In |
| 190 | +that sense it acts as some kind of compliance test suite that checks whether we |
| 191 | +implement the language correctly. The test suite is not perfect for this |
| 192 | +purpose. Since it is written for CPython's purposes during its development, a |
| 193 | +lot of the tests check really specific CPython implementation details. Examples |
| 194 | +for these are tests that check that ``__del__`` is called immediately after |
| 195 | +objects go out of scope (which only happens if you use reference counting as a |
| 196 | +garbage collection strategy, which PyPy doesn't do). Other examples are checking |
| 197 | +for exception error messages very explicitly. However, the CPython test suite |
| 198 | +has gotten a lot better in these regards over time, by adding |
| 199 | +``support.gc_collect()`` calls to fix the former problem, and by marking some |
| 200 | +very specific tests with the ``@impl_detail`` decorator. Thanks to all the |
| 201 | +CPython developers who have worked on this! |
| 202 | + |
| 203 | +In the process of re-implementing CPython's functionality and running CPython's |
| 204 | +tests suite, PyPy can often also be a good way to find bugs in CPython. While we |
| 205 | +think about the corner cases of some Python feature we occasionally find |
| 206 | +situations where CPython didn't get everything completely correct either, which |
| 207 | +we then report back. |
| 208 | + |
| 209 | + |
| 210 | +Testing Performance |
| 211 | +===================== |
| 212 | + |
| 213 | +All the tests we described so far are checking *behaviour*. But one of PyPy's |
| 214 | +important goals is to be a *fast* implementation not "just" a correct one. Some |
| 215 | +aspects of performance can be tested by regular unit tests, either application- |
| 216 | +or interpreter-level. In order to check whether some performance shortcut is |
| 217 | +taken in the interpreter, we sometimes can write tests that monkeypatch the slow |
| 218 | +default implementation to always error. Then, if the fast path is taken |
| 219 | +properly, that slow default implementation is never reached. |
| 220 | + |
| 221 | +But we also have additional tests that test the correct interaction with the JIT |
| 222 | +explicitly. For that, we have a special style of test that checks that the JIT |
| 223 | +will produce the correct machine code for a small snippet of Python code. To |
| 224 | +make this kind of test somewhat more robust, we don't check the machine code |
| 225 | +directly, but instead the architecture independent `intermediate |
| 226 | +representation`_ that the JIT uses to produce machine code from. |
| 227 | + |
| 228 | +.. _`intermediate representation`: https://www.pypy.org/posts/2018/09/the-first-15-years-of-pypy-3412615975376972020.html |
| 229 | + |
| 230 | +As an example, here is a small test that loading the attribute of a constant |
| 231 | +global instance can be completely constant folded away:: |
| 232 | + |
| 233 | + def test_load_attr(self): |
| 234 | + src = ''' |
| 235 | + class A(object): |
| 236 | + pass |
| 237 | + a = A() |
| 238 | + a.x = 1 |
| 239 | + def main(n): |
| 240 | + i = 0 |
| 241 | + while i < n: |
| 242 | + i = i + a.x |
| 243 | + return i |
| 244 | + ''' |
| 245 | + log = self.run(src, [1000]) |
| 246 | + assert log.result == 1000 |
| 247 | + loop, = log.loops_by_filename(self.filepath) |
| 248 | + assert loop.match(""" |
| 249 | + i9 = int_lt(i5, i6) |
| 250 | + guard_true(i9, descr=...) |
| 251 | + guard_not_invalidated(descr=...) |
| 252 | + i10 = int_add(i5, 1) |
| 253 | + --TICK-- |
| 254 | + jump(..., descr=...) |
| 255 | + """) |
| 256 | + |
| 257 | +The string passed to the ``loop.match`` function is a string representation of |
| 258 | +the intermediate representation code that is generated for the ``while`` loop in |
| 259 | +the ``main`` function given in the source. The important part of that |
| 260 | +intermediate representation is that the ``i = i + a.x`` addition is optimized |
| 261 | +into an ``int_add(x, 1)`` operation. The second argument for the addition is the |
| 262 | +constant ``1``, because the JIT noted that the global ``a`` is a constant, and |
| 263 | +the attribute ``x`` of that instance is always ``1``. The test thus checks that |
| 264 | +this optimization still works. |
| 265 | + |
| 266 | +Those tests are again more on the unit test side of things (and can thus |
| 267 | +unfortunately be a bit brittle sometimes and break). The integration test |
| 268 | +equivalent for performance is the `PyPy Speed Center`_ which tracks the |
| 269 | +performance of micro- and macro-benchmarks over time and lets us see when big |
| 270 | +performance regressions are happening. The speed center is not really an |
| 271 | +automatic test and does not produce pass/fail outcomes. Instead, it requires |
| 272 | +human judgement and intervention in order to interpret the performance changes. |
| 273 | +Having a real pass/fail mechanism is something that would be `great to have`_ |
| 274 | +but is probably `quite tricky in practice`_. |
| 275 | + |
| 276 | +.. _`great to have`: https://twitter.com/glyph/status/1495122754286198790 |
| 277 | +.. _`quite tricky in practice`: https://arxiv.org/abs/1602.00602 |
| 278 | + |
| 279 | +.. _`PyPy Speed Center`: https://speed.pypy.org/ |
| 280 | + |
| 281 | + |
| 282 | +Conclusion |
| 283 | +=========== |
| 284 | + |
| 285 | +This concludes my overview of some of the different styles of tests that we use |
| 286 | +to develop the PyPy Python interpreter. |
| 287 | + |
| 288 | +There is a whole other set of tests for the development of the RPython language, |
| 289 | +the garbage collectors it provides as well as the code that does the automatic |
| 290 | +JIT insertion, maybe I'll cover these in a future post. |
0 commit comments