1
- This document is about parsing python files in GraalPython implementation.
2
- It describes way how we obtain Truffle tree from a source.
3
-
4
- Creating Truffle tree for a python source has two phases. The first one creates
5
- simple syntax tree (SST) and scope tree, the second phase transforms the SST to
6
- the Truffle tree and for the transformation we need scope tree. The scope tree
7
- contains scope locations for variable and function definitions and information
8
- about scopes. The simple syntax tree contains nodes mirroring the source.
9
- Comparing SST and Truffle tree, the SST is much smaller. It contains just the nodes
10
- representing the source in a simple way. One SST node is usually translated
11
- to many Truffle nodes.
12
-
13
- The simple syntax tree can be created in two ways. With ANTLR parsing
14
- or deserialization from appropriate ` *.pyc ` file. In both cases together with
15
- scope tree. If there is no appropriate ` .pyc ` file for a source, then the source
16
- is parsed with ANTLR and result SST and scope tree is serialized to the ` .pyc ` file.
17
- The next time, we don't have to use ANTLR parser, because the result is already
18
- serialized in the ` .pyc ` file. So instead of parsing source file with ANTLR,
19
- we just deserialized SST and scope tree from the ` .pyc ` file. The deserialization
20
- is much faster then source parsing with ANTLR. The deserialization needs ruffly
21
- just 30% of the time that needs ANTLR parser. Of course the first run is little
22
- bit slower (we need to SST and scope tree save to the ` .pyc ` file).
23
-
24
- In the folder structure it looks like this:
1
+ # Python Code Parsing and pyc Files
2
+
3
+ This document elaborates on various things to consider about how Python files in
4
+ the GraalPython implementation are parsed.
5
+
6
+ ## Parser Performance
7
+
8
+ Creating the Truffle tree for a Python source has two phases. The first one
9
+ creates a simple syntax tree (SST) and a scope tree, the second phase transforms
10
+ the SST to the Truffle tree and for the transformation we need the scope
11
+ tree. The scope tree contains scope locations for variable and function
12
+ definitions and information about scopes. The simple syntax tree contains nodes
13
+ mirroring the source. Comparing the SST and Truffle tree, the SST is much
14
+ smaller. It contains just the nodes representing the source in a simple way. One
15
+ SST node is usually translated to many Truffle nodes.
16
+
17
+ The simple syntax tree can be created in two ways: with ANTLR parsing or
18
+ deserialization from an appropriate ` *.pyc ` file. If there is no appropriate
19
+ ` .pyc ` file for a source, then the source is parsed with ANTLR. If the Python
20
+ standard import logic finds an appropriate ` .pyc ` file, it will just trigger
21
+ deserialization of the SST and scope tree from it. The deserialization is much
22
+ faster than source parsing with ANTLR and needs only roughly 30% of the time
23
+ that ANTLR needs. Of course the first import of a new file is a little bit
24
+ slower - besides parsing with ANTLR, the Python standard library import logic
25
+ serializes the resulting code object to a ` .pyc ` file, which in our case means
26
+ the SST and scope tree are serialized such a file.
27
+
28
+ > * Summary* : Loading code from serialized ` .pyc ` files is faster than parsing
29
+ > the ` .py ` file using ANTLR.
30
+
31
+ ## Creating and Managing pyc Files
32
+
33
+ When a python source file (module) is imported during an execution for the first
34
+ time, then the appropriate ` .pyc ` file is created automatically. If the same
35
+ module is imported again, then the already created ` .pyc ` file is used. That
36
+ means that there are no ` .pyc ` files for source files that were not executed
37
+ (imported) yet. The creation of ` .pyc ` files is done entirely through the
38
+ Truffle FileSystem API, so that embedders can manage the file system access.
39
+
40
+ Every subsequent execution of a script will reuse the already existing ` .pyc `
41
+ files or will generate a new one. A ` .pyc ` file is regenerated if the timestamp
42
+ or hashcode of the original source file is changed. The hashcode is generated
43
+ only based on the Python source by calling ` source.hashCode() ` , which is the JDK
44
+ hash code over the array of source file bytes, calculated with
45
+ ` java.util.Arrays.hashCode(byte[]) ` . The ` .pyc ` files are also regenerated if a
46
+ magic number in the GraalPython parser is changed. The magic number is
47
+ hard-coded in the GraalPython source and can not be changed by the user (unless
48
+ that user has access to the bytecode of GraalPython, in which case they can do
49
+ anything they want already). The developers of GraalPython change the magic
50
+ number when the format of SST or scope tree binary data is altered. This is an
51
+ implementation detail, so the magic number does not have to correspond to the
52
+ version of GraalPython (just like in CPython). The magic number of pyc is a
53
+ function of the concrete GraalPython Java code that is running.
54
+
55
+ > * Summary* : ` .pyc ` files are created automatically by the GraalPython runtime
56
+ > when no or an invalid ` .pyc ` file is found matching the desired ` .py ` file.
57
+
58
+ > ** Important** : If you use ` .pyc ` files, you will need to allow write-access to
59
+ > the GraalPython runtime at least when switching versions or changing the
60
+ > original source code - otherwise, the regeneration of source files will fail
61
+ > and every import will have the overhead of accessing the old ` .pyc ` file,
62
+ > parsing the code, serializing it, and trying (and failing) to write out a new
63
+ > ` .pyc ` file.
64
+
65
+ A ` *.pyc ` file is never deleted by GraalPython, only regenerated. It is
66
+ regenerated when the appropriate source file is changed (timestamp of last
67
+ modification or hashcode of the content) or the magic number of the GraalPython
68
+ parser changes. Magic number changes will be communicated in the release notes
69
+ so that embedders or system administrators can delete old ` .pyc ` files when
70
+ upgrading.
71
+
72
+ The folder structure created for ` .pyc ` files looks like this:
25
73
26
74
```
27
75
top_folder
@@ -36,17 +84,18 @@ top_folder
36
84
sourceX.py
37
85
```
38
86
39
- On the same directory level of a source code file, the ` __pycache__ ` directory
40
- is created and in this directory are stored all ` .*pyc ` files from the same
41
- directory. There can be also files created with CPython, so user can see there
42
- also files with extension ` *.cpython3-6.pyc ` for example.
43
-
44
- The current implementation includes also copy of the original text into `.pyc' file.
45
- The reason is that we create from this Truffle Source object with path to the
46
- original source file, but we do not need to read the original ` *.py ` file, which
47
- speed up the process obtaining Truffle tree (we read just one file).
87
+ By default the ` __pycache__ ` directory is created on the same directory level of
88
+ a source code file and in this directory all ` .pyc ` files from the same
89
+ directory are stored. This folder may store ` .pyc ` files created with different
90
+ versions of Python (including e.g. CPython), so the user may see files ending in
91
+ ` *.cpython3-6.pyc ` for example.
48
92
49
- The structure of a ` .graalpython.pyc ` file is this:
93
+ The current implementation also includes a copy of the original source text in
94
+ the `.pyc' file. This is a minor performance optimization so we can create a
95
+ Truffle ` Source ` object with the path to the original source file, but we do not
96
+ need to read the original ` *.py ` file, which speeds up the process obtaining
97
+ Truffle tree (we read just one file). The structure of a ` .graalpython.pyc ` file
98
+ is this:
50
99
51
100
```
52
101
MAGIC_NUMBER
@@ -55,30 +104,80 @@ binary data - scope tree
55
104
binary data - simple syntax tree
56
105
```
57
106
58
- The serialized SST and scope tree is stored in Code object as well, attribute ` code `
107
+ > ** Important** : ` .pyc ` files are not an effective means to hide Python library
108
+ > source code from guest code, since the original source can still be recovered
109
+ > and even if the source were omitted, the syntax tree contains enough
110
+ > information to decompile into source code easily.
111
+
112
+ The serialized SST and scope tree is stored in a Python ` code ` object as well,
113
+ as the content of the attribute ` co_code ` (which contains bytecode on CPython).
59
114
60
115
For example:
61
116
```
62
117
>>> def add(x, y):
63
- ... print('Running x+y')
64
- ... return x+y
118
+ ... return x+y
65
119
...
66
- >>> co = add.__code__
67
- >>> co.co_code
120
+ >>> add.__code__.co_code
68
121
b'\x01\x00\x00\x02[]K\xbf\xd1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...'
69
122
```
70
123
71
- The creating ` *.pyc ` files can be canceled / allowed in the same ways like in CPython:
124
+ The creation of ` *.pyc ` files can be controlled in the same ways as on CPython
125
+ (c.f. https://docs.python.org/3/using/cmdline.html ):
126
+
127
+ * The GraalPython launcher (` graalpython ` ) reads the ` PYTHONDONTWRITEBYTECODE `
128
+ environment variable - if this is set to a non-empty string, Python will not
129
+ try to write ` .pyc ` files when importing modules.
130
+ * The launcher command line option ` -B ` , if given, has the same effect as the
131
+ above.
132
+ * Guest language code can change the attribute ` dont_write_bytecode ` of the
133
+ ` sys ` built-in module at runtime to change the behavior for subsequent
134
+ imports.
135
+ * The launcher reads the ` PYTHONPYCACHEPREFIX ` environment variable - if set,
136
+ the ` __pycache__ ` directory will be created at the path pointed to by the
137
+ prefix, and a mirror of the directory structure of the source tree will be
138
+ created on-demand to house the ` .pyc ` files.
139
+ * Guest language code can change the attribute ` pycache_prefix ` of the ` sys `
140
+ module at runtime to change the location for subsequent imports.
141
+
142
+ Since the embedder cannot use environment variables or CPython options to
143
+ communicate these options to GraalPython, we make these options available as
144
+ these language options:
145
+
146
+ * ` python.DontWriteBytecodeFlag ` - equivalent to ` -B ` or ` PYTHONDONTWRITEBYTECODE `
147
+ * ` python.PyCachePrefix ` - equivalent to ` PYTHONPYCACHEPREFIX `
148
+
149
+ > * Summary* : ` .pyc ` files are largely managed automatically by the runtime in a
150
+ > manner compatible to CPython. Like on CPython there are options to specify
151
+ > their location and if they should be written at all and both of these options
152
+ > can be changed by guest code.
153
+
154
+ > ** Important** : By default a GraalPython context will not enable writing ` .pyc `
155
+ > files. The ` graalpython ` launcher enables it by default, but if this is
156
+ > desired in the embedding use case, care should be taken to ensure that the
157
+ > ` __pycache__ ` location is properly managed and the files in that location are
158
+ > secured against manipulation just like the source ` .py ` files they were
159
+ > derived from.
160
+
161
+ > ** Important** : When upgrading application sources or GraalPython, old ` .pyc `
162
+ > files must be removed by the embedder as required.
163
+
164
+ ## Security Considerations
72
165
73
- * evironment variable: PYTHONDONTWRITEBYTECODE - If this is set to a non-empty string,
74
- Python won’t try to write .pyc files on the import of source modules.
75
- * command line option: -B, If given, Python won’t try to write .pyc files on
76
- the import of source modules.
77
- * in a code: setting attribute ` dont_write_bytecode ` of ` sys ` built in module
166
+ The serialization of SST and scope tree is hand written and during
167
+ deserialization is not possible to load other classes than SSTNodes. We do not
168
+ use Java serialization or other frameworks to serialize Java objects. The main
169
+ reason was performance, but this has the effect that no class loading can be
170
+ forced by a maliciously crafted ` .pyc ` file.
78
171
172
+ All file operations (obtaining the data, timestamps, and writing ` pyc ` files)
173
+ are done through the Truffle FileSystem API. Embedders can modify all of these
174
+ operations by means of custom (e.g. read-only) FileSystem implementations. The
175
+ embedder can also effectively disable the creation of ` .pyc ` files by disabling
176
+ I/O permissions for GraalPython.
79
177
80
- ## Security
81
- The serialization of SST and scope tree is hand written and during deserialization
82
- is not possible to load other classes then SSTNodes. It doesn't use Java serialization
83
- or other framework to serialize Java object. The main reason was performance.
84
- The performance can be maximize in this way. The next reason was the security.
178
+ If the ` .pyc ` files are not readable, their location is not writable, or the
179
+ ` .pyc ` files' serialization data or magic numbers are corrupted in any way, the
180
+ deserialization fails and we just parse the ` .py ` file again. This comes with a
181
+ minor performance hit * only* for the parsing of modules, which should not be
182
+ significant for most applications (provided the application does actual work
183
+ besides loading Python code).
0 commit comments