Skip to content

Commit fa42168

Browse files
committed
[GR-24160] Do not write bytecode by default when embedded
PullRequest: graalpython/1028
2 parents 151042d + 962b439 commit fa42168

File tree

4 files changed

+189
-90
lines changed

4 files changed

+189
-90
lines changed

docs/contributor/PARSER_DETAILS.md

Lines changed: 0 additions & 84 deletions
This file was deleted.

docs/user/PARSER_DETAILS.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Python Code Parsing and pyc Files
2+
3+
This document elaborates on various things to consider about how Python files in
4+
the GraalPython implementation are parsed.
5+
6+
## Parser Performance
7+
8+
Creating the Truffle tree for a Python source has two phases. The first one
9+
creates a simple syntax tree (SST) and a scope tree, the second phase transforms
10+
the SST to the Truffle tree and for the transformation we need the scope
11+
tree. The scope tree contains scope locations for variable and function
12+
definitions and information about scopes. The simple syntax tree contains nodes
13+
mirroring the source. Comparing the SST and Truffle tree, the SST is much
14+
smaller. It contains just the nodes representing the source in a simple way. One
15+
SST node is usually translated to many Truffle nodes.
16+
17+
The simple syntax tree can be created in two ways: with ANTLR parsing or
18+
deserialization from an appropriate `*.pyc` file. If there is no appropriate
19+
`.pyc` file for a source, then the source is parsed with ANTLR. If the Python
20+
standard import logic finds an appropriate `.pyc` file, it will just trigger
21+
deserialization of the SST and scope tree from it. The deserialization is much
22+
faster than source parsing with ANTLR and needs only roughly 30% of the time
23+
that ANTLR needs. Of course the first import of a new file is a little bit
24+
slower - besides parsing with ANTLR, the Python standard library import logic
25+
serializes the resulting code object to a `.pyc` file, which in our case means
26+
the SST and scope tree are serialized such a file.
27+
28+
> *Summary*: Loading code from serialized `.pyc` files is faster than parsing
29+
> the `.py` file using ANTLR.
30+
31+
## Creating and Managing pyc Files
32+
33+
When a python source file (module) is imported during an execution for the first
34+
time, then the appropriate `.pyc` file is created automatically. If the same
35+
module is imported again, then the already created `.pyc` file is used. That
36+
means that there are no `.pyc` files for source files that were not executed
37+
(imported) yet. The creation of `.pyc` files is done entirely through the
38+
Truffle FileSystem API, so that embedders can manage the file system access.
39+
40+
Every subsequent execution of a script will reuse the already existing `.pyc`
41+
files or will generate a new one. A `.pyc` file is regenerated if the timestamp
42+
or hashcode of the original source file is changed. The hashcode is generated
43+
only based on the Python source by calling `source.hashCode()`, which is the JDK
44+
hash code over the array of source file bytes, calculated with
45+
`java.util.Arrays.hashCode(byte[])`. The `.pyc` files are also regenerated if a
46+
magic number in the GraalPython parser is changed. The magic number is
47+
hard-coded in the GraalPython source and can not be changed by the user (unless
48+
that user has access to the bytecode of GraalPython, in which case they can do
49+
anything they want already). The developers of GraalPython change the magic
50+
number when the format of SST or scope tree binary data is altered. This is an
51+
implementation detail, so the magic number does not have to correspond to the
52+
version of GraalPython (just like in CPython). The magic number of pyc is a
53+
function of the concrete GraalPython Java code that is running.
54+
55+
> *Summary*: `.pyc` files are created automatically by the GraalPython runtime
56+
> when no or an invalid `.pyc` file is found matching the desired `.py` file.
57+
58+
> **Important**: If you use `.pyc` files, you will need to allow write-access to
59+
> the GraalPython runtime at least when switching versions or changing the
60+
> original source code - otherwise, the regeneration of source files will fail
61+
> and every import will have the overhead of accessing the old `.pyc` file,
62+
> parsing the code, serializing it, and trying (and failing) to write out a new
63+
> `.pyc` file.
64+
65+
A `*.pyc` file is never deleted by GraalPython, only regenerated. It is
66+
regenerated when the appropriate source file is changed (timestamp of last
67+
modification or hashcode of the content) or the magic number of the GraalPython
68+
parser changes. Magic number changes will be communicated in the release notes
69+
so that embedders or system administrators can delete old `.pyc` files when
70+
upgrading.
71+
72+
The folder structure created for `.pyc` files looks like this:
73+
74+
```
75+
top_folder
76+
__pycache__
77+
sourceA.graalpython.pyc
78+
sourceB.graalpython.pyc
79+
sourceA.py
80+
sourceB.py
81+
sub_folder
82+
__pycache__
83+
sourceX.graalpython.pyc
84+
sourceX.py
85+
```
86+
87+
By default the `__pycache__` directory is created on the same directory level of
88+
a source code file and in this directory all `.pyc` files from the same
89+
directory are stored. This folder may store `.pyc` files created with different
90+
versions of Python (including e.g. CPython), so the user may see files ending in
91+
`*.cpython3-6.pyc` for example.
92+
93+
The current implementation also includes a copy of the original source text in
94+
the `.pyc' file. This is a minor performance optimization so we can create a
95+
Truffle `Source` object with the path to the original source file, but we do not
96+
need to read the original `*.py` file, which speeds up the process obtaining
97+
Truffle tree (we read just one file). The structure of a `.graalpython.pyc` file
98+
is this:
99+
100+
```
101+
MAGIC_NUMBER
102+
source text
103+
binary data - scope tree
104+
binary data - simple syntax tree
105+
```
106+
107+
> **Important**: `.pyc` files are not an effective means to hide Python library
108+
> source code from guest code, since the original source can still be recovered
109+
> and even if the source were omitted, the syntax tree contains enough
110+
> information to decompile into source code easily.
111+
112+
The serialized SST and scope tree is stored in a Python `code` object as well,
113+
as the content of the attribute `co_code` (which contains bytecode on CPython).
114+
115+
For example:
116+
```
117+
>>> def add(x, y):
118+
... return x+y
119+
...
120+
>>> add.__code__.co_code
121+
b'\x01\x00\x00\x02[]K\xbf\xd1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...'
122+
```
123+
124+
The creation of `*.pyc` files can be controlled in the same ways as on CPython
125+
(c.f. https://docs.python.org/3/using/cmdline.html):
126+
127+
* The GraalPython launcher (`graalpython`) reads the `PYTHONDONTWRITEBYTECODE`
128+
environment variable - if this is set to a non-empty string, Python will not
129+
try to write `.pyc` files when importing modules.
130+
* The launcher command line option `-B`, if given, has the same effect as the
131+
above.
132+
* Guest language code can change the attribute `dont_write_bytecode` of the
133+
`sys` built-in module at runtime to change the behavior for subsequent
134+
imports.
135+
* The launcher reads the `PYTHONPYCACHEPREFIX` environment variable - if set,
136+
the `__pycache__` directory will be created at the path pointed to by the
137+
prefix, and a mirror of the directory structure of the source tree will be
138+
created on-demand to house the `.pyc` files.
139+
* Guest language code can change the attribute `pycache_prefix` of the `sys`
140+
module at runtime to change the location for subsequent imports.
141+
142+
Since the embedder cannot use environment variables or CPython options to
143+
communicate these options to GraalPython, we make these options available as
144+
these language options:
145+
146+
* `python.DontWriteBytecodeFlag` - equivalent to `-B` or `PYTHONDONTWRITEBYTECODE`
147+
* `python.PyCachePrefix` - equivalent to `PYTHONPYCACHEPREFIX`
148+
149+
> *Summary*: `.pyc` files are largely managed automatically by the runtime in a
150+
> manner compatible to CPython. Like on CPython there are options to specify
151+
> their location and if they should be written at all and both of these options
152+
> can be changed by guest code.
153+
154+
> **Important**: By default a GraalPython context will not enable writing `.pyc`
155+
> files. The `graalpython` launcher enables it by default, but if this is
156+
> desired in the embedding use case, care should be taken to ensure that the
157+
> `__pycache__` location is properly managed and the files in that location are
158+
> secured against manipulation just like the source `.py` files they were
159+
> derived from.
160+
161+
> **Important**: When upgrading application sources or GraalPython, old `.pyc`
162+
> files must be removed by the embedder as required.
163+
164+
## Security Considerations
165+
166+
The serialization of SST and scope tree is hand written and during
167+
deserialization is not possible to load other classes than SSTNodes. We do not
168+
use Java serialization or other frameworks to serialize Java objects. The main
169+
reason was performance, but this has the effect that no class loading can be
170+
forced by a maliciously crafted `.pyc` file.
171+
172+
All file operations (obtaining the data, timestamps, and writing `pyc` files)
173+
are done through the Truffle FileSystem API. Embedders can modify all of these
174+
operations by means of custom (e.g. read-only) FileSystem implementations. The
175+
embedder can also effectively disable the creation of `.pyc` files by disabling
176+
I/O permissions for GraalPython.
177+
178+
If the `.pyc` files are not readable, their location is not writable, or the
179+
`.pyc` files' serialization data or magic numbers are corrupted in any way, the
180+
deserialization fails and we just parse the `.py` file again. This comes with a
181+
minor performance hit *only* for the parsing of modules, which should not be
182+
significant for most applications (provided the application does actual work
183+
besides loading Python code).

graalpython/com.oracle.graal.python/src/com/oracle/graal/python/builtins/modules/SysModuleBuiltins.java

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,6 @@
5151
import java.util.Date;
5252
import java.util.List;
5353

54-
import com.oracle.graal.python.util.OverflowException;
55-
import org.graalvm.nativeimage.ImageInfo;
56-
5754
import com.oracle.graal.python.PythonLanguage;
5855
import com.oracle.graal.python.builtins.Builtin;
5956
import com.oracle.graal.python.builtins.CoreFunctions;
@@ -85,6 +82,7 @@
8582
import com.oracle.graal.python.runtime.PythonOptions;
8683
import com.oracle.graal.python.runtime.exception.PException;
8784
import com.oracle.graal.python.runtime.object.PythonObjectFactory;
85+
import com.oracle.graal.python.util.OverflowException;
8886
import com.oracle.graal.python.util.PythonUtils;
8987
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
9088
import com.oracle.truffle.api.Truffle;
@@ -99,6 +97,8 @@
9997
import com.oracle.truffle.api.library.CachedLibrary;
10098
import com.oracle.truffle.api.profiles.ConditionProfile;
10199

100+
import org.graalvm.nativeimage.ImageInfo;
101+
102102
@CoreFunctions(defineModule = "sys")
103103
public class SysModuleBuiltins extends PythonBuiltins {
104104
private static final String LICENSE = "Copyright (c) Oracle and/or its affiliates. Licensed under the Universal Permissive License v 1.0 as shown at http://oss.oracle.com/licenses/upl.";
@@ -193,13 +193,13 @@ public void postInitialize(PythonCore core) {
193193
sys.setAttribute("executable", context.getOption(PythonOptions.Executable));
194194
sys.setAttribute("_base_executable", context.getOption(PythonOptions.Executable));
195195
}
196-
sys.setAttribute("dont_write_bytecode", ImageInfo.inImageBuildtimeCode() || context.getOption(PythonOptions.DontWriteBytecodeFlag));
196+
sys.setAttribute("dont_write_bytecode", context.getOption(PythonOptions.DontWriteBytecodeFlag));
197197
String pycachePrefix = context.getOption(PythonOptions.PyCachePrefix);
198198
sys.setAttribute("pycache_prefix", pycachePrefix.isEmpty() ? PNone.NONE : pycachePrefix);
199199
sys.setAttribute("__flags__", core.factory().createTuple(new Object[]{
200200
false, // bytes_warning
201201
!context.getOption(PythonOptions.PythonOptimizeFlag), // debug
202-
ImageInfo.inImageBuildtimeCode() || context.getOption(PythonOptions.DontWriteBytecodeFlag), // dont_write_bytecode
202+
context.getOption(PythonOptions.DontWriteBytecodeFlag), // dont_write_bytecode
203203
false, // hash_randomization
204204
context.getOption(PythonOptions.IgnoreEnvironmentFlag), // ignore_environment
205205
context.getOption(PythonOptions.InspectFlag), // inspect

graalpython/com.oracle.graal.python/src/com/oracle/graal/python/runtime/PythonOptions.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ private PythonOptions() {
107107
public static final OptionKey<Boolean> IsolateFlag = new OptionKey<>(false);
108108

109109
@Option(category = OptionCategory.USER, help = "Equivalent to the Python -B flag. Don't write bytecode files.", stability = OptionStability.STABLE) //
110-
public static final OptionKey<Boolean> DontWriteBytecodeFlag = new OptionKey<>(false);
110+
public static final OptionKey<Boolean> DontWriteBytecodeFlag = new OptionKey<>(true);
111111

112112
@Option(category = OptionCategory.USER, help = "If this is set, GraalPython will write .pyc files in a mirror directory tree at this path, " +
113113
"instead of in __pycache__ directories within the source tree. " +

0 commit comments

Comments
 (0)