|
| 1 | +This document is about parsing python files in GraalPython implementation. |
| 2 | +It describes way how we obtain Truffle tree from a source. |
| 3 | + |
| 4 | +Creating Truffle tree for a python source has two phases. The first one creates |
| 5 | +simple syntax tree (SST) and scope tree, the second phase transforms the SST to |
| 6 | +the Truffle tree and for the transformation we need scope tree. The scope tree |
| 7 | +contains scope locations for variable and function definitions and information |
| 8 | +about scopes. The simple syntax tree contains nodes mirroring the source. |
| 9 | +Comparing SST and Truffle tree, the SST is much smaller. It contains just the nodes |
| 10 | +representing the source in a simple way. One SST node is usually translated |
| 11 | +to many Truffle nodes. |
| 12 | + |
| 13 | +The simple syntax tree can be created in two ways. With ANTLR parsing |
| 14 | +or deserialization from appropriate `*.pyc` file. In both cases together with |
| 15 | +scope tree. If there is no appropriate `.pyc` file for a source, then the source |
| 16 | +is parsed with ANTLR and result SST and scope tree is serialized to the `.pyc` file. |
| 17 | +The next time, we don't have to use ANTLR parser, because the result is already |
| 18 | +serialized in the `.pyc` file. So instead of parsing source file with ANTLR, |
| 19 | +we just deserialized SST and scope tree from the `.pyc` file. The deserialization |
| 20 | +is much faster then source parsing with ANTLR. The deserialization needs ruffly |
| 21 | +just 30% of the time that needs ANTLR parser. Of course the first run is little |
| 22 | +bit slower (we need to SST and scope tree save to the `.pyc` file). |
| 23 | + |
| 24 | +In the folder structure it looks like this: |
| 25 | + |
| 26 | +``` |
| 27 | +top_folder |
| 28 | + __pycache__ |
| 29 | + sourceA.graalpython.pyc |
| 30 | + sourceB.graalpython.pyc |
| 31 | + sourceA.py |
| 32 | + sourceB.py |
| 33 | + sub_folder |
| 34 | + __pycache__ |
| 35 | + sourceX.graalpython.pyc |
| 36 | + sourceX.py |
| 37 | +``` |
| 38 | + |
| 39 | +On the same directory level of a source code file, the `__pycache__` directory |
| 40 | +is created and in this directory are stored all `.*pyc` files from the same |
| 41 | +directory. There can be also files created with CPython, so user can see there |
| 42 | +also files with extension `*.cpython3-6.pyc` for example. |
| 43 | + |
| 44 | +The current implementation includes also copy of the original text into `.pyc' file. |
| 45 | +The reason is that we create from this Truffle Source object with path to the |
| 46 | +original source file, but we do not need to read the original `*.py` file, which |
| 47 | +speed up the process obtaining Truffle tree (we read just one file). |
| 48 | + |
| 49 | +The structure of a `.graalpython.pyc` file is this: |
| 50 | + |
| 51 | +``` |
| 52 | +MAGIC_NUMBER |
| 53 | +source text |
| 54 | +binary data - scope tree |
| 55 | +binary data - simple syntax tree |
| 56 | +``` |
| 57 | + |
| 58 | +The serialized SST and scope tree is stored in Code object as well, attribute `code` |
| 59 | + |
| 60 | +For example: |
| 61 | +``` |
| 62 | +>>> def add(x, y): |
| 63 | +... print('Running x+y') |
| 64 | +... return x+y |
| 65 | +... |
| 66 | +>>> co = add.__code__ |
| 67 | +>>> co.co_code |
| 68 | +b'\x01\x00\x00\x02[]K\xbf\xd1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...' |
| 69 | +``` |
| 70 | + |
| 71 | +The creating `*.pyc` files can be canceled / allowed in the same ways like in CPython: |
| 72 | + |
| 73 | + * evironment variable: PYTHONDONTWRITEBYTECODE - If this is set to a non-empty string, |
| 74 | +Python won’t try to write .pyc files on the import of source modules. |
| 75 | + * command line option: -B, If given, Python won’t try to write .pyc files on |
| 76 | +the import of source modules. |
| 77 | + * in a code: setting attribute `dont_write_bytecode` of `sys` built in module |
| 78 | + |
| 79 | + |
| 80 | +## Security |
| 81 | +The serialization of SST and scope tree is hand written and during deserialization |
| 82 | +is not possible to load other classes then SSTNodes. It doesn't use Java serialization |
| 83 | +or other framework to serialize Java object. The main reason was performance. |
| 84 | +The performance can be maximize in this way. The next reason was the security. |
0 commit comments