Skip to content

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Feb 12, 2025

@pitrou pointed out that the JIT's stencils are bloated with zeroed bytes. Since we request fresh pages of memory for JIT code, it's guaranteed to be zeroed anyways, so we can save space in the file and operations at runtime by eliding the writes where appropriate.

Here's a before-and-after for one of our most common uops, _CHECK_VALIDITY:

void
emit__CHECK_VALIDITY(
    unsigned char *code, unsigned char *data, _PyExecutorObject *executor,
    const _PyUOpInstruction *instruction, jit_state *state)
{
    // 
    // _CHECK_VALIDITY.o:     file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 48 8b 05 00 00 00 00          movq    (%rip), %rax            # 0x7 <_JIT_ENTRY+0x7>
    // 0000000000000003:  R_X86_64_REX_GOTPCRELX       _JIT_EXECUTOR-0x4
    // 7: f6 40 22 01                   testb   $0x1, 0x22(%rax)
    // b: 75 06                         jne     0x13 <_JIT_ENTRY+0x13>
    // d: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x13 <_JIT_ENTRY+0x13>
    // 000000000000000f:  R_X86_64_GOTPCRELX   _JIT_JUMP_TARGET-0x4
    // 13: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x19 <_JIT_ENTRY+0x19>
    // 0000000000000015:  R_X86_64_GOTPCRELX   _JIT_CONTINUE-0x4
    const unsigned char code_body[19] = {
        0x48, 0x8b, 0x05, 0x00, 0x00, 0x00, 0x00, 0xf6,
        0x40, 0x22, 0x01, 0x75, 0x06, 0xff, 0x25, 0x00,
        0x00, 0x00, 0x00,
    };
    // 0: EXECUTOR
    // 8: JUMP_TARGET
    const unsigned char data_body[16] = {
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    };
    memcpy(data, data_body, sizeof(data_body));
    patch_64(data + 0x0, (uintptr_t)executor);
    patch_64(data + 0x8, state->instruction_starts[instruction->jump_target]);
    memcpy(code, code_body, sizeof(code_body));
    patch_x86_64_32rx(code + 0x3, (uintptr_t)data + -0x4);
    patch_x86_64_32rx(code + 0xf, (uintptr_t)data + 0x4);
}
void
emit__CHECK_VALIDITY(
    unsigned char *code, unsigned char *data, _PyExecutorObject *executor,
    const _PyUOpInstruction *instruction, jit_state *state)
{
    // 
    // _CHECK_VALIDITY.o:     file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 48 8b 05 00 00 00 00          movq    (%rip), %rax            # 0x7 <_JIT_ENTRY+0x7>
    // 0000000000000003:  R_X86_64_REX_GOTPCRELX       _JIT_EXECUTOR-0x4
    // 7: f6 40 22 01                   testb   $0x1, 0x22(%rax)
    // b: 75 06                         jne     0x13 <_JIT_ENTRY+0x13>
    // d: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x13 <_JIT_ENTRY+0x13>
    // 000000000000000f:  R_X86_64_GOTPCRELX   _JIT_JUMP_TARGET-0x4
    // 13: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x19 <_JIT_ENTRY+0x19>
    // 0000000000000015:  R_X86_64_GOTPCRELX   _JIT_CONTINUE-0x4
    const unsigned char code_body[19] = {
        0x48, 0x8b, 0x05, 0x00, 0x00, 0x00, 0x00, 0xf6,
        0x40, 0x22, 0x01, 0x75, 0x06, 0xff, 0x25,
    };
    // 0: EXECUTOR
    // 8: JUMP_TARGET
    patch_64(data + 0x0, (uintptr_t)executor);
    patch_64(data + 0x8, state->instruction_starts[instruction->jump_target]);
    memcpy(code, code_body, sizeof(code_body));
    patch_x86_64_32rx(code + 0x3, (uintptr_t)data + -0x4);
    patch_x86_64_32rx(code + 0xf, (uintptr_t)data + 0x4);
}

@brandtbucher brandtbucher added skip news interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build topic-JIT labels Feb 12, 2025
@brandtbucher brandtbucher self-assigned this Feb 12, 2025
@pitrou
Copy link
Member

pitrou commented Feb 12, 2025

Since we request fresh pages of memory for JIT code, it's guaranteed to be zeroed anyways, so we can save space in the file and operations at runtime by eliding the writes where appropriate.

Note that even without that property, you could simply have issued a memset instead of copying from a statically-allocated area of zeros :)

Here's a before-and-after for one of our most common uops, _CHECK_VALIDITY:

It seems strange to have a dedicated µop doing just this :) Is there a documentation for µops somewhere?

@brandtbucher
Copy link
Member Author

brandtbucher commented Feb 12, 2025

It seems strange to have a dedicated µop doing just this :)

Does it? The role of this uop is to quickly check a single bit of state to check the our optimizer's assumptions hold. This can happen in lots of different places (a single Py_DECREF can change the world), so it helps to have a small check for it that can be put anywhere.

Is there a documentation for µops somewhere?

The general format and approach is documented in InternalDocs/jit.md. The individual uops aren't documented publicly, since they're a very unstable, low level implementation detail of an experimental feature. If there's a real need to internally document each of the 296 uops we currently have, we can probably find the time to do it. But most of them are either simple enough to follow (like type or dictionary version checks), or are identical to a full bytecode instruction that's already documented.

Copy link
Member

@savannahostrowski savannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smaller stencils 🎉

@TeamSpen210
Copy link

In the stripped version, code_body is still set to have the original length. Looks like the format string wasn't updated?

@brandtbucher
Copy link
Member Author

Yeah, that's expected. There are places where we use sizeof(code_body), and those are a bit more disruptive to change. I felt it wasn't worth it... the real wins come from saving space in the file, and removing entire memcpy calls.

@brandtbucher brandtbucher merged commit 05e89c3 into python:main Feb 13, 2025
66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) skip news topic-JIT

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants