Skip to content

Commit bef58d0

Browse files
committed
[demo] Claude Opus 4.1 documents the CPython JIT
Prompt: explain how the experimental JIT works in particular as it relates to the ARM aarch64 architecture, update the related InternalDocs/ not reviewed by this human beyond skimming. done an an example for the cpython core team sprint. If done in reality: continue with this claude session, having looked over the results and ask it questions, point out things you'd like to see done better or explained further.
1 parent baf7470 commit bef58d0

File tree

1 file changed

+114
-0
lines changed

1 file changed

+114
-0
lines changed

InternalDocs/jit.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,10 @@ the uop interpreter at `tier2_dispatch`, the executor runs the function
9595
that `jit_code` points to. This function returns the instruction pointer
9696
of the next Tier 1 instruction that needs to execute.
9797

98+
The JIT uses platform-specific calling conventions and optimizations:
99+
- On x86-64: Uses the `preserve_none` calling convention for efficiency
100+
- On ARM64: Leverages guaranteed tail calls (`musttail`) for continuation-passing style
101+
98102
The generation of the jitted functions uses the copy-and-patch technique
99103
which is described in
100104
[Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/).
@@ -123,8 +127,118 @@ their implementations do not require changes related to the stencils,
123127
because everything is automatically generated from
124128
[`Python/bytecodes.c`](../Python/bytecodes.c) at build time.
125129

130+
## Architecture-Specific Implementation
131+
132+
The JIT compiler supports multiple architectures with platform-specific optimizations:
133+
134+
### Supported Platforms
135+
136+
The JIT currently supports the following target triples:
137+
- **ARM64/AArch64**: `aarch64-apple-darwin`, `aarch64-pc-windows-msvc`, `aarch64-unknown-linux-gnu`
138+
- **x86-64**: `x86_64-apple-darwin`, `x86_64-pc-windows-msvc`, `x86_64-unknown-linux-gnu`
139+
- **x86**: `i686-pc-windows-msvc`
140+
141+
### ARM AArch64 Implementation Details
142+
143+
The ARM64 JIT implementation uses sophisticated instruction patching and relocation techniques:
144+
145+
#### Instruction Encoding and Patching
146+
147+
The JIT manipulates several AArch64 instruction formats (defined in [`Python/jit.c`](../Python/jit.c)):
148+
- **ADRP** (Address of Page): Used for 21-bit page-relative addressing
149+
- **LDR/STR**: Load/store with 12-bit immediate offsets
150+
- **MOV**: Move with 16-bit immediate values
151+
- **Branch instructions**: 28-bit relative branches
152+
153+
#### Relocation Types
154+
155+
The ARM64 JIT handles multiple relocation types:
156+
157+
1. **12-bit relocations** (`patch_aarch64_12`): Low 12 bits of addresses, used with LDR/STR and ADD/SUB
158+
2. **16-bit relocations** (`patch_aarch64_16a/b/c/d`): Four-part 64-bit address construction using MOV instructions
159+
3. **21-bit page relocations** (`patch_aarch64_21r`): Page count between current and target pages
160+
4. **26-bit branch relocations** (`patch_aarch64_26r`): Direct branch instructions with ±128MB range
161+
5. **Relaxable relocations** (`patch_aarch64_12x`, `patch_aarch64_21rx`): Can be optimized to immediate values
162+
163+
#### Trampolines
164+
165+
For branches beyond the 128MB range, the JIT generates trampolines:
166+
```
167+
ldr x8, [pc + 8] ; Load 64-bit address
168+
br x8 ; Branch to address
169+
.quad target_addr ; 64-bit target address
170+
```
171+
Each trampoline is 16 bytes on ARM64 (vs. no trampolines needed on x86).
172+
173+
#### GOT Load Relaxation
174+
175+
The JIT optimizes Global Offset Table (GOT) loads when possible:
176+
- Pairs of ADRP + LDR instructions can be relaxed to ADRP + ADD for known addresses
177+
- This optimization (`patch_aarch64_33rx`) reduces memory accesses
178+
179+
### Build Process and Dependencies
180+
181+
#### LLVM Requirement
182+
183+
The JIT requires LLVM 19+ for compilation because:
184+
- **Clang** is the only C compiler supporting guaranteed tail calls (`musttail`)
185+
- **llvm-readobj** is used for extracting object file information
186+
- **llvm-objdump** provides disassembly for debugging
187+
188+
#### Stencil Generation
189+
190+
The build process ([`Tools/jit/build.py`](../Tools/jit/build.py)):
191+
1. Compiles each micro-op implementation to object code using platform-specific flags
192+
2. Extracts relocations and symbol information using LLVM tools
193+
3. Generates stencils (code templates) in `jit_stencils.h`
194+
4. Platform selection happens at compile time based on target conditions
195+
196+
Platform-specific compilation flags:
197+
- **aarch64-linux**: `-fpic -mno-outline-atomics` (position-independent code, avoid atomic intrinsics)
198+
- **aarch64-darwin**: Optimizer uses `OptimizerAArch64` class
199+
- **aarch64-windows**: `-fms-runtime-lib=dll -fplt` (DLL runtime, PLT usage)
200+
201+
### Memory Management
202+
203+
The JIT uses platform-specific memory allocation:
204+
205+
#### Memory Allocation
206+
- **Unix/Linux**: Uses `mmap()` with `MAP_ANONYMOUS | MAP_PRIVATE`
207+
- **Windows**: Uses `VirtualAlloc()` with `MEM_COMMIT | MEM_RESERVE`
208+
- **Page size**: Determined via `sysconf(_SC_PAGESIZE)` or `GetSystemInfo()`
209+
210+
#### Memory Layout
211+
```
212+
[Executable Code] [Trampolines] [Padding] [Data Section] [Page Padding]
213+
```
214+
- Code section: Contains emitted machine code
215+
- Trampoline section: Platform-specific size (16 bytes per trampoline on ARM64)
216+
- Data alignment: 8 bytes on ARM64, 1 byte on x86
217+
- Total allocation: Rounded up to page size
218+
219+
#### Protection and Execution
220+
After code emission, memory protection is set:
221+
- Unix: `mprotect()` with `PROT_EXEC | PROT_READ`
222+
- Windows: `VirtualProtect()` with `PAGE_EXECUTE_READ`
223+
224+
### Optimization Passes
225+
226+
The JIT includes architecture-specific optimizers ([`Tools/jit/_optimizers.py`](../Tools/jit/_optimizers.py)):
227+
228+
#### OptimizerAArch64
229+
- Recognizes ARM64 branch pattern: `b <target>`
230+
- No branch inversion (unlike x86)
231+
- Focuses on trampoline optimization
232+
233+
#### OptimizerX86
234+
- Handles extensive branch inversion (JE ↔ JNE, etc.)
235+
- Recognizes `jmp` and `ret` instructions
236+
- More complex control flow optimization
237+
126238
See Also:
127239

128240
* [Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode](https://arxiv.org/abs/2011.13127)
129241

130242
* [PyCon 2024: Building a JIT compiler for CPython](https://www.youtube.com/watch?v=kMO3Ju0QCDo)
243+
244+
* [ARM64 Instruction Set Reference](https://developer.arm.com/documentation/ddi0602/latest/)

0 commit comments

Comments
 (0)