ESP32 JIT compiler. #9966

RSC-Games · 2022-11-14T03:46:50Z

RSC-Games
Nov 14, 2022

I am aware that JIT was previously mentioned in #4085, but I'm willing to work on this, so please hear me out.

I'm in the process of building a game console that utilizes MicroPython. I started over a year ago and I'm getting close to done but I have a few issues. I've been puzzling for a while how to make MicroPython faster. I know I can write C modules for certain functionality and APIs, but game developers only have access to natmod, and the viper and native decorators. I could be wrong, but I noticed that the code compiled by the viper and native decorators sticks around long after its needed. Also, I can only fit small amounts of code into IRAM. So when using large amounts of code that needs to run natively and at the same time, you run out of RAM.

So, after a few weeks of thought (and a lot of time spent playing on emulated consoles), I finally got an idea: a JIT Compiler. It will definitely increase memory usage and code size but it would result in significant speed boosts in certain cases.

So, if I'm remembering correctly, a JIT consists of the following components:

Compiler (Already mostly implemented for xtensawin thanks to the viper and native decorators)
Code analyzer. I feel like this would take a lot of memory and CPU time, but there's probably a way to minimize this.
Cache. This would take a lot of RAM so this may need to be limited in size/entries or just not exist.
Linker. If "hot" code is compiled by the JIT then we need to execute the instructions and not the bytecode. Then we'll have to relink it once this code is evicted from cache.
Code pools. This is where we store the actively executing code.

This looks like a pretty complex system (and I'm sure the current compiler takes an input file to compile and not just a function but I could be wrong) and it would take a lot of time and effort to make a decent JIT.

I'm willing to do most if not all of the work on this. If @dpgeorge you have any suggestions please let me know. And if this JIT turns out to work pretty well, then I'll submit a PR to merge this into MicroPython as an optional component because I'm sure most people don't need a JIT and it will take a lot of memory (especially IRAM).

jimmo · 2022-11-14T04:46:05Z

jimmo
Nov 14, 2022
Maintainer

@RSC-Games

Just a quick question -- is the issue here mostly the limited amount of IRAM? i.e. do you really need a JIT to solve your problem, rather just a more effective way to be able to use @native and @viper in more places?

For example, if the native emitter continued to work as it does right now (i.e. during the compile phase, not at runtime/JIT), but stored in regular RAM and copied to IRAM as the function was executed (perhaps based on some sort of LRU policy).

This is more complicated than it sounds because you'll likely need to do address relocations etc (classic linker/loader stuff) but compared to implementing a JIT it's a lot simpler. Also wouldn't require making huge modifications to the VM and compiler.
You could potentially solve this by only allowing a single function to be resident in IRAM at any time and that way the code locations are fixed. (Some thought required about how to make native functions able to call eachother).

The other thing to consider is that the ESP32 can execute code from flash. So making the native emitter able to emit to an XIP flash region might get what you want too, without needing to worry about relocations (although you want to be more careful about how this is updated w.r.t. wear levelling etc).

See also #8381 -- this doesn't support native code, but I suspect making native code work with this might be easier than implementing a JIT!

0 replies

RSC-Games · 2022-11-14T13:47:32Z

RSC-Games
Nov 14, 2022
Author

Hi @jimmo

Yes, lack of IRAM was a pretty big inspiration for this but not the only one. I can't quite squeeze as much performance out of MicroPython as I wanted, and a JIT would give pretty large performance benefits, at the cost of significant RAM use (I have PSRAM). I have compared the speed of an ESP32 revision 1 without PSRAM and an ESP32 rev 3 with PSRAM and with specialized builds the difference in executed code speed are not significant.

I could allocate a buffer in IRAM and add a trap for the InstrFetchProhibited exception, check if the accessed address is currently cached in DRAM/PSRAM, and then copy the text if it exists. This could potentially be abused with intentional instruction fetches in DRAM addresses, so it would be possible to directly execute a payload if it was formatted correctly.

Or I could keep a "page table" with a list of current functions and which ones are currently swapped in to IRAM (and their offset addresses). This would remove the need for an exception handler and resuming execution with the previous register context.

As for allocating flash pages for execution. This would be the best bet for ensuring the most code would fit into an executable region, but there are design reasons why I'm trying to avoid excessive flash writes.

I probably should have explained this before, but I chose MicroPython for my console due to its beginner friendliness, its VFS capability, and its ability to execute bytecode from RAM. I froze the modules that are required for booting into the MicroPython binary, and then the rest of the libraries that can be updated on the field are stored in the VFS partition of Flash. All files are stored pre-compiled on the filesystem and most of them reside in encrypted virtual disks (like most of my system libraries are stored in a systemfs.firm file at root). Also, each game is shipped as an encrypted .iso file and installed on flash or the SD. To boot any of these titles, I just write the name and path of the title into RTC RAM (to minimize Flash writes), and the image is mounted and executed. While the system is pretty usable, I noticed that it can't do too much in 33 ms or a CPU bottleneck develops. Native code would give the CPU more time to process the game code, but if I implemented the swapping algorithm for the native code, if I had more hot code than slots to store the code in, it would start thrashing and slow the system down. I was looking into a JIT for all around system performance gains in regions with frequently executing code (like for loops).

Also, (if I decided to JIT all of the code while the system was running) if I were to inject a nop every third instruction for the emitted instruction stream, would the other CPU be able to keep up with compilation? This is just a thought and probably not a good course of action.

I think for now I'll write a code analysis tool that picks up long for loops, potentially relocates them and marks them as native or viper code depending on how hard they are to optimise (and get type information). Then I'll test it out with the code swapping and see how much of a performance gain I get.

0 replies

dpgeorge · 2022-11-14T23:27:17Z

dpgeorge
Nov 14, 2022
Maintainer

Before you attempt to JIT I would recommend looking at the existing AOT (ahead of time) compiler (native/viper) and try to measure if it would give you enough of a performance boost.

Also definitely look at #8381 (essentially dynamic freezing of code), that might be enough to alleviate your iRAM issues.

3 replies

RSC-Games Nov 15, 2022
Author

I'll check this out. For the most part I haven't needed to use the @Native and @viper decorators. I had one game I ported to micropython that needed excessive use of the @Native decorator. I noticed that after compiling enough code as native code I got error: xtensa bccz index out of range. I looked it up and read through some of the source and couldn't figure out what it meant. After removing a certain amount of lines from a function I noticed it stopped showing up, so I guessed it was an issue with IRAM use.

Currently almost all of the code in my consoles' provided libraries runs fast enough that speed isn't an issue, but for games once I put enough in the tick loop it starts slowing down. There are probably other was I can fix this.

jimmo Nov 15, 2022
Maintainer

For the time invested (compared to anything involved with modifying MicroPython's emitter or playing with IRAM), I suspect you can get a lot more value by writing some "accelerated" helper functions in C for common game tasks. (In particular anything arithmetic-heavy or where Python leads to unavoidable heap allocations).

RSC-Games Nov 15, 2022
Author

I already plan to utilize the second core for rendering effects like raycasting and affine transformations (with all code written in C) and audio mixing. I have access to the full source so if I need faster C code to do something I can write it but I want most code written for this device to be compatible with all other devices running the same firmware, so other developers will have to demonstrate the game code running on my custom firmware. So for these developers utilizing frozen C code would not work but if I can figure it out I could manage to distribute a minimal SDK for compiling native C modules.

Also, if possible, could you tell me more about the xtensa bccz index out of range error please? If possible I would like to replace the message with a more informative error.

jimmo · 2022-11-15T10:39:15Z

jimmo
Nov 15, 2022
Maintainer

FWIW, would you consider using an ESP32-S2 or S3 in your design, which allows putting executable sections into PSRAM?

1 reply

RSC-Games Nov 15, 2022
Author

That's a great idea! Unfortunately, my current system and code is tailored to the ESP32 specifically. I need one of the integrated DACs and both cores for my system. Honestly right now I'm leaning towards relinking the AOT code and managing 3 executable IRAM slots that are managed with a LRU policy. When trying to execute native code, if the currently needed code is not in IRAM, then it will be copied into the slot that was least recently used. When new code is added to the pool, an entry for it is stored with its base address and length. That way I can memcpy it into IRAM when it is needed. I'm planning a successor for this current design with the S3 but it requires the code for this one to be completed.

harbaum · 2022-11-26T18:38:47Z

harbaum
Nov 26, 2022

I was also thinking about the gaming abilities of micropython. Putting a few more thoughts into algorithms allows to write things like fast triangle routines and the like. The following demo runs on stock micropython meaning that the video driver and the 3d routines are written in micropython. The result is not that bad:

https://youtube.com/shorts/EcZD9xHFwBc

However, you e.g. see some stuttering every few seconds due to garbage collection. Calling garbage collection e.g. every frame makes things smoother but also much slower. GC is imho the major problem with gaming.

If performance is an issue in general you can of course always move stuff to the native side. For a gaming platform you could e.g. include video and audio drivers and things like sprite engines natively. This may result in a pretty fast gaming setup leaving only the code of the specific game on the python side.

1 reply

RSC-Games Nov 28, 2022
Author

My codebase from the alpha release in July was pretty unoptimised. It took about 10 seconds to boot, which I don't think a JIT would have increased. I decided to rewrite the bootloader for the master branch and boot times are significantly lower at about 2 seconds. So I think some of my libraries are on the slower side and need optimisation.

MicroPython

ESP32 JIT compiler. #9966

Uh oh!

RSC-Games Nov 14, 2022

Replies: 5 comments · 5 replies

Uh oh!

jimmo Nov 14, 2022 Maintainer

Uh oh!

RSC-Games Nov 14, 2022 Author

Uh oh!

dpgeorge Nov 14, 2022 Maintainer

Uh oh!

RSC-Games Nov 15, 2022 Author

Uh oh!

jimmo Nov 15, 2022 Maintainer

Uh oh!

RSC-Games Nov 15, 2022 Author

Uh oh!

jimmo Nov 15, 2022 Maintainer

Uh oh!

RSC-Games Nov 15, 2022 Author

Uh oh!

harbaum Nov 26, 2022

Uh oh!

RSC-Games Nov 28, 2022 Author

RSC-Games
Nov 14, 2022

Replies: 5 comments 5 replies

jimmo
Nov 14, 2022
Maintainer

RSC-Games
Nov 14, 2022
Author

dpgeorge
Nov 14, 2022
Maintainer

RSC-Games Nov 15, 2022
Author

jimmo Nov 15, 2022
Maintainer

RSC-Games Nov 15, 2022
Author

jimmo
Nov 15, 2022
Maintainer

RSC-Games Nov 15, 2022
Author

harbaum
Nov 26, 2022

RSC-Games Nov 28, 2022
Author