|
| 1 | +# Python Code Parsing and pyc Files |
| 2 | + |
| 3 | +This document elaborates on various things to consider about how Python files in |
| 4 | +the GraalPython implementation are parsed. |
| 5 | + |
| 6 | +## Parser Performance |
| 7 | + |
| 8 | +Creating the Truffle tree for a Python source has two phases. The first one |
| 9 | +creates a simple syntax tree (SST) and a scope tree, the second phase transforms |
| 10 | +the SST to the Truffle tree and for the transformation we need the scope |
| 11 | +tree. The scope tree contains scope locations for variable and function |
| 12 | +definitions and information about scopes. The simple syntax tree contains nodes |
| 13 | +mirroring the source. Comparing the SST and Truffle tree, the SST is much |
| 14 | +smaller. It contains just the nodes representing the source in a simple way. One |
| 15 | +SST node is usually translated to many Truffle nodes. |
| 16 | + |
| 17 | +The simple syntax tree can be created in two ways: with ANTLR parsing or |
| 18 | +deserialization from an appropriate `*.pyc` file. If there is no appropriate |
| 19 | +`.pyc` file for a source, then the source is parsed with ANTLR. If the Python |
| 20 | +standard import logic finds an appropriate `.pyc` file, it will just trigger |
| 21 | +deserialization of the SST and scope tree from it. The deserialization is much |
| 22 | +faster than source parsing with ANTLR and needs only roughly 30% of the time |
| 23 | +that ANTLR needs. Of course the first import of a new file is a little bit |
| 24 | +slower - besides parsing with ANTLR, the Python standard library import logic |
| 25 | +serializes the resulting code object to a `.pyc` file, which in our case means |
| 26 | +the SST and scope tree are serialized such a file. |
| 27 | + |
| 28 | +> *Summary*: Loading code from serialized `.pyc` files is faster than parsing |
| 29 | +> the `.py` file using ANTLR. |
| 30 | +
|
| 31 | +## Creating and Managing pyc Files |
| 32 | + |
| 33 | +When a python source file (module) is imported during an execution for the first |
| 34 | +time, then the appropriate `.pyc` file is created automatically. If the same |
| 35 | +module is imported again, then the already created `.pyc` file is used. That |
| 36 | +means that there are no `.pyc` files for source files that were not executed |
| 37 | +(imported) yet. The creation of `.pyc` files is done entirely through the |
| 38 | +Truffle FileSystem API, so that embedders can manage the file system access. |
| 39 | + |
| 40 | +Every subsequent execution of a script will reuse the already existing `.pyc` |
| 41 | +files or will generate a new one. A `.pyc` file is regenerated if the timestamp |
| 42 | +or hashcode of the original source file is changed. The hashcode is generated |
| 43 | +only based on the Python source by calling `source.hashCode()`, which is the JDK |
| 44 | +hash code over the array of source file bytes, calculated with |
| 45 | +`java.util.Arrays.hashCode(byte[])`. The `.pyc` files are also regenerated if a |
| 46 | +magic number in the GraalPython parser is changed. The magic number is |
| 47 | +hard-coded in the GraalPython source and can not be changed by the user (unless |
| 48 | +that user has access to the bytecode of GraalPython, in which case they can do |
| 49 | +anything they want already). The developers of GraalPython change the magic |
| 50 | +number when the format of SST or scope tree binary data is altered. This is an |
| 51 | +implementation detail, so the magic number does not have to correspond to the |
| 52 | +version of GraalPython (just like in CPython). The magic number of pyc is a |
| 53 | +function of the concrete GraalPython Java code that is running. |
| 54 | + |
| 55 | +> *Summary*: `.pyc` files are created automatically by the GraalPython runtime |
| 56 | +> when no or an invalid `.pyc` file is found matching the desired `.py` file. |
| 57 | +
|
| 58 | +> **Important**: If you use `.pyc` files, you will need to allow write-access to |
| 59 | +> the GraalPython runtime at least when switching versions or changing the |
| 60 | +> original source code - otherwise, the regeneration of source files will fail |
| 61 | +> and every import will have the overhead of accessing the old `.pyc` file, |
| 62 | +> parsing the code, serializing it, and trying (and failing) to write out a new |
| 63 | +> `.pyc` file. |
| 64 | +
|
| 65 | +A `*.pyc` file is never deleted by GraalPython, only regenerated. It is |
| 66 | +regenerated when the appropriate source file is changed (timestamp of last |
| 67 | +modification or hashcode of the content) or the magic number of the GraalPython |
| 68 | +parser changes. Magic number changes will be communicated in the release notes |
| 69 | +so that embedders or system administrators can delete old `.pyc` files when |
| 70 | +upgrading. |
| 71 | + |
| 72 | +The folder structure created for `.pyc` files looks like this: |
| 73 | + |
| 74 | +``` |
| 75 | +top_folder |
| 76 | + __pycache__ |
| 77 | + sourceA.graalpython.pyc |
| 78 | + sourceB.graalpython.pyc |
| 79 | + sourceA.py |
| 80 | + sourceB.py |
| 81 | + sub_folder |
| 82 | + __pycache__ |
| 83 | + sourceX.graalpython.pyc |
| 84 | + sourceX.py |
| 85 | +``` |
| 86 | + |
| 87 | +By default the `__pycache__` directory is created on the same directory level of |
| 88 | +a source code file and in this directory all `.pyc` files from the same |
| 89 | +directory are stored. This folder may store `.pyc` files created with different |
| 90 | +versions of Python (including e.g. CPython), so the user may see files ending in |
| 91 | +`*.cpython3-6.pyc` for example. |
| 92 | + |
| 93 | +The current implementation also includes a copy of the original source text in |
| 94 | +the `.pyc' file. This is a minor performance optimization so we can create a |
| 95 | +Truffle `Source` object with the path to the original source file, but we do not |
| 96 | +need to read the original `*.py` file, which speeds up the process obtaining |
| 97 | +Truffle tree (we read just one file). The structure of a `.graalpython.pyc` file |
| 98 | +is this: |
| 99 | + |
| 100 | +``` |
| 101 | +MAGIC_NUMBER |
| 102 | +source text |
| 103 | +binary data - scope tree |
| 104 | +binary data - simple syntax tree |
| 105 | +``` |
| 106 | + |
| 107 | +> **Important**: `.pyc` files are not an effective means to hide Python library |
| 108 | +> source code from guest code, since the original source can still be recovered |
| 109 | +> and even if the source were omitted, the syntax tree contains enough |
| 110 | +> information to decompile into source code easily. |
| 111 | +
|
| 112 | +The serialized SST and scope tree is stored in a Python `code` object as well, |
| 113 | +as the content of the attribute `co_code` (which contains bytecode on CPython). |
| 114 | + |
| 115 | +For example: |
| 116 | +``` |
| 117 | +>>> def add(x, y): |
| 118 | +... return x+y |
| 119 | +... |
| 120 | +>>> add.__code__.co_code |
| 121 | +b'\x01\x00\x00\x02[]K\xbf\xd1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...' |
| 122 | +``` |
| 123 | + |
| 124 | +The creation of `*.pyc` files can be controlled in the same ways as on CPython |
| 125 | +(c.f. https://docs.python.org/3/using/cmdline.html): |
| 126 | + |
| 127 | + * The GraalPython launcher (`graalpython`) reads the `PYTHONDONTWRITEBYTECODE` |
| 128 | + environment variable - if this is set to a non-empty string, Python will not |
| 129 | + try to write `.pyc` files when importing modules. |
| 130 | + * The launcher command line option `-B`, if given, has the same effect as the |
| 131 | + above. |
| 132 | + * Guest language code can change the attribute `dont_write_bytecode` of the |
| 133 | + `sys` built-in module at runtime to change the behavior for subsequent |
| 134 | + imports. |
| 135 | + * The launcher reads the `PYTHONPYCACHEPREFIX` environment variable - if set, |
| 136 | + the `__pycache__` directory will be created at the path pointed to by the |
| 137 | + prefix, and a mirror of the directory structure of the source tree will be |
| 138 | + created on-demand to house the `.pyc` files. |
| 139 | + * Guest language code can change the attribute `pycache_prefix` of the `sys` |
| 140 | + module at runtime to change the location for subsequent imports. |
| 141 | + |
| 142 | +Since the embedder cannot use environment variables or CPython options to |
| 143 | +communicate these options to GraalPython, we make these options available as |
| 144 | +these language options: |
| 145 | + |
| 146 | + * `python.DontWriteBytecodeFlag` - equivalent to `-B` or `PYTHONDONTWRITEBYTECODE` |
| 147 | + * `python.PyCachePrefix` - equivalent to `PYTHONPYCACHEPREFIX` |
| 148 | + |
| 149 | +> *Summary*: `.pyc` files are largely managed automatically by the runtime in a |
| 150 | +> manner compatible to CPython. Like on CPython there are options to specify |
| 151 | +> their location and if they should be written at all and both of these options |
| 152 | +> can be changed by guest code. |
| 153 | +
|
| 154 | +> **Important**: By default a GraalPython context will not enable writing `.pyc` |
| 155 | +> files. The `graalpython` launcher enables it by default, but if this is |
| 156 | +> desired in the embedding use case, care should be taken to ensure that the |
| 157 | +> `__pycache__` location is properly managed and the files in that location are |
| 158 | +> secured against manipulation just like the source `.py` files they were |
| 159 | +> derived from. |
| 160 | +
|
| 161 | +> **Important**: When upgrading application sources or GraalPython, old `.pyc` |
| 162 | +> files must be removed by the embedder as required. |
| 163 | +
|
| 164 | +## Security Considerations |
| 165 | + |
| 166 | +The serialization of SST and scope tree is hand written and during |
| 167 | +deserialization is not possible to load other classes than SSTNodes. We do not |
| 168 | +use Java serialization or other frameworks to serialize Java objects. The main |
| 169 | +reason was performance, but this has the effect that no class loading can be |
| 170 | +forced by a maliciously crafted `.pyc` file. |
| 171 | + |
| 172 | +All file operations (obtaining the data, timestamps, and writing `pyc` files) |
| 173 | +are done through the Truffle FileSystem API. Embedders can modify all of these |
| 174 | +operations by means of custom (e.g. read-only) FileSystem implementations. The |
| 175 | +embedder can also effectively disable the creation of `.pyc` files by disabling |
| 176 | +I/O permissions for GraalPython. |
| 177 | + |
| 178 | +If the `.pyc` files are not readable, their location is not writable, or the |
| 179 | +`.pyc` files' serialization data or magic numbers are corrupted in any way, the |
| 180 | +deserialization fails and we just parse the `.py` file again. This comes with a |
| 181 | +minor performance hit *only* for the parsing of modules, which should not be |
| 182 | +significant for most applications (provided the application does actual work |
| 183 | +besides loading Python code). |
0 commit comments