Skip to content

Commit a7b639d

Browse files
committed
update parser docs based on discussions
1 parent ef9afca commit a7b639d

File tree

1 file changed

+149
-50
lines changed

1 file changed

+149
-50
lines changed

docs/contributor/PARSER_DETAILS.md

Lines changed: 149 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,75 @@
1-
This document is about parsing python files in GraalPython implementation.
2-
It describes way how we obtain Truffle tree from a source.
3-
4-
Creating Truffle tree for a python source has two phases. The first one creates
5-
simple syntax tree (SST) and scope tree, the second phase transforms the SST to
6-
the Truffle tree and for the transformation we need scope tree. The scope tree
7-
contains scope locations for variable and function definitions and information
8-
about scopes. The simple syntax tree contains nodes mirroring the source.
9-
Comparing SST and Truffle tree, the SST is much smaller. It contains just the nodes
10-
representing the source in a simple way. One SST node is usually translated
11-
to many Truffle nodes.
12-
13-
The simple syntax tree can be created in two ways. With ANTLR parsing
14-
or deserialization from appropriate `*.pyc` file. In both cases together with
15-
scope tree. If there is no appropriate `.pyc` file for a source, then the source
16-
is parsed with ANTLR and result SST and scope tree is serialized to the `.pyc` file.
17-
The next time, we don't have to use ANTLR parser, because the result is already
18-
serialized in the `.pyc` file. So instead of parsing source file with ANTLR,
19-
we just deserialized SST and scope tree from the `.pyc` file. The deserialization
20-
is much faster then source parsing with ANTLR. The deserialization needs ruffly
21-
just 30% of the time that needs ANTLR parser. Of course the first run is little
22-
bit slower (we need to SST and scope tree save to the `.pyc` file).
23-
24-
In the folder structure it looks like this:
1+
# Python Code Parsing and pyc Files
2+
3+
This document elaborates on various things to consider about how Python files in
4+
the GraalPython implementation are parsed.
5+
6+
## Parser Performance
7+
8+
Creating the Truffle tree for a Python source has two phases. The first one
9+
creates a simple syntax tree (SST) and a scope tree, the second phase transforms
10+
the SST to the Truffle tree and for the transformation we need the scope
11+
tree. The scope tree contains scope locations for variable and function
12+
definitions and information about scopes. The simple syntax tree contains nodes
13+
mirroring the source. Comparing the SST and Truffle tree, the SST is much
14+
smaller. It contains just the nodes representing the source in a simple way. One
15+
SST node is usually translated to many Truffle nodes.
16+
17+
The simple syntax tree can be created in two ways: with ANTLR parsing or
18+
deserialization from an appropriate `*.pyc` file. If there is no appropriate
19+
`.pyc` file for a source, then the source is parsed with ANTLR. If the Python
20+
standard import logic finds an appropriate `.pyc` file, it will just trigger
21+
deserialization of the SST and scope tree from it. The deserialization is much
22+
faster than source parsing with ANTLR and needs only roughly 30% of the time
23+
that ANTLR needs. Of course the first import of a new file is a little bit
24+
slower - besides parsing with ANTLR, the Python standard library import logic
25+
serializes the resulting code object to a `.pyc` file, which in our case means
26+
the SST and scope tree are serialized such a file.
27+
28+
> *Summary*: Loading code from serialized `.pyc` files is faster than parsing
29+
> the `.py` file using ANTLR.
30+
31+
## Creating and Managing pyc Files
32+
33+
When a python source file (module) is imported during an execution for the first
34+
time, then the appropriate `.pyc` file is created automatically. If the same
35+
module is imported again, then the already created `.pyc` file is used. That
36+
means that there are no `.pyc` files for source files that were not executed
37+
(imported) yet. The creation of `.pyc` files is done entirely through the
38+
Truffle FileSystem API, so that embedders can manage the file system access.
39+
40+
Every subsequent execution of a script will reuse the already existing `.pyc`
41+
files or will generate a new one. A `.pyc` file is regenerated if the timestamp
42+
or hashcode of the original source file is changed. The hashcode is generated
43+
only based on the Python source by calling `source.hashCode()`, which is the JDK
44+
hash code over the array of source file bytes, calculated with
45+
`java.util.Arrays.hashCode(byte[])`. The `.pyc` files are also regenerated if a
46+
magic number in the GraalPython parser is changed. The magic number is
47+
hard-coded in the GraalPython source and can not be changed by the user (unless
48+
that user has access to the bytecode of GraalPython, in which case they can do
49+
anything they want already). The developers of GraalPython change the magic
50+
number when the format of SST or scope tree binary data is altered. This is an
51+
implementation detail, so the magic number does not have to correspond to the
52+
version of GraalPython (just like in CPython). The magic number of pyc is a
53+
function of the concrete GraalPython Java code that is running.
54+
55+
> *Summary*: `.pyc` files are created automatically by the GraalPython runtime
56+
> when no or an invalid `.pyc` file is found matching the desired `.py` file.
57+
58+
> **Important**: If you use `.pyc` files, you will need to allow write-access to
59+
> the GraalPython runtime at least when switching versions or changing the
60+
> original source code - otherwise, the regeneration of source files will fail
61+
> and every import will have the overhead of accessing the old `.pyc` file,
62+
> parsing the code, serializing it, and trying (and failing) to write out a new
63+
> `.pyc` file.
64+
65+
A `*.pyc` file is never deleted by GraalPython, only regenerated. It is
66+
regenerated when the appropriate source file is changed (timestamp of last
67+
modification or hashcode of the content) or the magic number of the GraalPython
68+
parser changes. Magic number changes will be communicated in the release notes
69+
so that embedders or system administrators can delete old `.pyc` files when
70+
upgrading.
71+
72+
The folder structure created for `.pyc` files looks like this:
2573

2674
```
2775
top_folder
@@ -36,17 +84,18 @@ top_folder
3684
sourceX.py
3785
```
3886

39-
On the same directory level of a source code file, the `__pycache__` directory
40-
is created and in this directory are stored all `.*pyc` files from the same
41-
directory. There can be also files created with CPython, so user can see there
42-
also files with extension `*.cpython3-6.pyc` for example.
43-
44-
The current implementation includes also copy of the original text into `.pyc' file.
45-
The reason is that we create from this Truffle Source object with path to the
46-
original source file, but we do not need to read the original `*.py` file, which
47-
speed up the process obtaining Truffle tree (we read just one file).
87+
By default the `__pycache__` directory is created on the same directory level of
88+
a source code file and in this directory all `.pyc` files from the same
89+
directory are stored. This folder may store `.pyc` files created with different
90+
versions of Python (including e.g. CPython), so the user may see files ending in
91+
`*.cpython3-6.pyc` for example.
4892

49-
The structure of a `.graalpython.pyc` file is this:
93+
The current implementation also includes a copy of the original source text in
94+
the `.pyc' file. This is a minor performance optimization so we can create a
95+
Truffle `Source` object with the path to the original source file, but we do not
96+
need to read the original `*.py` file, which speeds up the process obtaining
97+
Truffle tree (we read just one file). The structure of a `.graalpython.pyc` file
98+
is this:
5099

51100
```
52101
MAGIC_NUMBER
@@ -55,30 +104,80 @@ binary data - scope tree
55104
binary data - simple syntax tree
56105
```
57106

58-
The serialized SST and scope tree is stored in Code object as well, attribute `code`
107+
> **Important**: `.pyc` files are not an effective means to hide Python library
108+
> source code from guest code, since the original source can still be recovered
109+
> and even if the source were omitted, the syntax tree contains enough
110+
> information to decompile into source code easily.
111+
112+
The serialized SST and scope tree is stored in a Python `code` object as well,
113+
as the content of the attribute `co_code` (which contains bytecode on CPython).
59114

60115
For example:
61116
```
62117
>>> def add(x, y):
63-
... print('Running x+y')
64-
... return x+y
118+
... return x+y
65119
...
66-
>>> co = add.__code__
67-
>>> co.co_code
120+
>>> add.__code__.co_code
68121
b'\x01\x00\x00\x02[]K\xbf\xd1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...'
69122
```
70123

71-
The creating `*.pyc` files can be canceled / allowed in the same ways like in CPython:
124+
The creation of `*.pyc` files can be controlled in the same ways as on CPython
125+
(c.f. https://docs.python.org/3/using/cmdline.html):
126+
127+
* The GraalPython launcher (`graalpython`) reads the `PYTHONDONTWRITEBYTECODE`
128+
environment variable - if this is set to a non-empty string, Python will not
129+
try to write `.pyc` files when importing modules.
130+
* The launcher command line option `-B`, if given, has the same effect as the
131+
above.
132+
* Guest language code can change the attribute `dont_write_bytecode` of the
133+
`sys` built-in module at runtime to change the behavior for subsequent
134+
imports.
135+
* The launcher reads the `PYTHONPYCACHEPREFIX` environment variable - if set,
136+
the `__pycache__` directory will be created at the path pointed to by the
137+
prefix, and a mirror of the directory structure of the source tree will be
138+
created on-demand to house the `.pyc` files.
139+
* Guest language code can change the attribute `pycache_prefix` of the `sys`
140+
module at runtime to change the location for subsequent imports.
141+
142+
Since the embedder cannot use environment variables or CPython options to
143+
communicate these options to GraalPython, we make these options available as
144+
these language options:
145+
146+
* `python.DontWriteBytecodeFlag` - equivalent to `-B` or `PYTHONDONTWRITEBYTECODE`
147+
* `python.PyCachePrefix` - equivalent to `PYTHONPYCACHEPREFIX`
148+
149+
> *Summary*: `.pyc` files are largely managed automatically by the runtime in a
150+
> manner compatible to CPython. Like on CPython there are options to specify
151+
> their location and if they should be written at all and both of these options
152+
> can be changed by guest code.
153+
154+
> **Important**: By default a GraalPython context will not enable writing `.pyc`
155+
> files. The `graalpython` launcher enables it by default, but if this is
156+
> desired in the embedding use case, care should be taken to ensure that the
157+
> `__pycache__` location is properly managed and the files in that location are
158+
> secured against manipulation just like the source `.py` files they were
159+
> derived from.
160+
161+
> **Important**: When upgrading application sources or GraalPython, old `.pyc`
162+
> files must be removed by the embedder as required.
163+
164+
## Security Considerations
72165

73-
* evironment variable: PYTHONDONTWRITEBYTECODE - If this is set to a non-empty string,
74-
Python won’t try to write .pyc files on the import of source modules.
75-
* command line option: -B, If given, Python won’t try to write .pyc files on
76-
the import of source modules.
77-
* in a code: setting attribute `dont_write_bytecode` of `sys` built in module
166+
The serialization of SST and scope tree is hand written and during
167+
deserialization is not possible to load other classes than SSTNodes. We do not
168+
use Java serialization or other frameworks to serialize Java objects. The main
169+
reason was performance, but this has the effect that no class loading can be
170+
forced by a maliciously crafted `.pyc` file.
78171

172+
All file operations (obtaining the data, timestamps, and writing `pyc` files)
173+
are done through the Truffle FileSystem API. Embedders can modify all of these
174+
operations by means of custom (e.g. read-only) FileSystem implementations. The
175+
embedder can also effectively disable the creation of `.pyc` files by disabling
176+
I/O permissions for GraalPython.
79177

80-
## Security
81-
The serialization of SST and scope tree is hand written and during deserialization
82-
is not possible to load other classes then SSTNodes. It doesn't use Java serialization
83-
or other framework to serialize Java object. The main reason was performance.
84-
The performance can be maximize in this way. The next reason was the security.
178+
If the `.pyc` files are not readable, their location is not writable, or the
179+
`.pyc` files' serialization data or magic numbers are corrupted in any way, the
180+
deserialization fails and we just parse the `.py` file again. This comes with a
181+
minor performance hit *only* for the parsing of modules, which should not be
182+
significant for most applications (provided the application does actual work
183+
besides loading Python code).

0 commit comments

Comments
 (0)