Skip to content

Conversation

@arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented Feb 18, 2025

This implements the technique described in https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html, which addresses the issues described in http://lua-users.org/lists/lua-l/2011-02/msg00742.html. Python recently implemented this, which resulted in a 9-15% performance improvements: https://blog.reverberate.org/2025/02/10/tail-call-updates.html.

It turns out that @dstogov already addressed these by using a different technique, enabled when compiling with GCC, so this will not improve performances with this compiler, but it makes PHP on Clang as fast as on GCC.

Benchmarks

Zend/bench.php:

Benchmark 1: /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.006 s ±  0.002 s    [User: 0.984 s, System: 0.020 s]
  Range (min … max):    1.003 s …  1.008 s    10 runs
 
Benchmark 2: /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.783 s ±  0.009 s    [User: 1.761 s, System: 0.019 s]
  Range (min … max):    1.771 s …  1.801 s    10 runs
 
Benchmark 3: /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.017 s ±  0.003 s    [User: 0.998 s, System: 0.018 s]
  Range (min … max):    1.014 s …  1.023 s    10 runs
 
Summary
  /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php ran
    1.01 ± 0.00 times faster than /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
    1.77 ± 0.01 times faster than /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php

PHP/Clang was 77% slower in this benchmark, now only 1% slower.

Symfony Demo:

gcc-base:    mean:  0.5064;  stddev:  0.0008;  diff:  +0.00%
clang-base:  mean:  0.5344;  stddev:  0.0006;  diff:  +5.53%
clang-tail:  mean:  0.5017;  stddev:  0.0008;  diff:  -0.94%

PHP/Clang was 5% slower in this benchmark.

Current interpreter

The interpreter is generated by Zend/zend_vm_gen.php. Multiple modes are supported, but the default (and only supported mode) is the hybrid one, which generates both a call-based interpreter and a GCC-specific interpreter. Which one is actually compiled depends on the compiler being used.

In the call-based interpreter, op code handlers are separate functions, the next opline to execute is stored in execute_data, and execute_data is passed as argument to op handlers:

void execute_ex() {
    while (1) {
        int ret = execute_data->opline->handler(execute_data);
        if (ret != 0) {
            // leave interpreter
        }
    }
}

// example op handler
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
    // load opline
    const zend_op *opline = execute_data->opline;

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    execute_data->opline++;
    return 0; // ZEND_VM_CONTINUE()
}

Handlers typically load execute_data->opline, execute the operation, update execute_data->opline, and return.

There is quite a lot of overhead: The call instruction pushes a return address on the stack, the function saves/spills registers, etc. E.g. the code of ZEND_INIT_FCALL_SPEC_CONST_HANDLER() starts with

push   %rbp
push   %r15
push   %r14
push   %rbx
push   %rax

Also, opline needs to be loaded/stored from/to memory.

The GCC interpreter manages to eliminate the overhead. opline->handler is a computed-goto target, which calls the actual handler. Hot handlers are inlined, FP/IP (execute_data/opline) are register variables, handlers take no arguments and have no return value:

void execute_ex() {
    goto opline->handler;
    ZEND_INIT_FCALL_SPEC_CONST_LABEL:
        ZEND_INIT_FCALL_SPEC_CONST_HANDLER(); // inlined
        goto opline->handler;
    ... (other handlers)
    ZEND_RETURN:
        // leave interpreter
}

void always_inline ZEND_INIT_FCALL_SPEC_CONST_HANDLER(void) {
    // opline is already in a register
    
    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    return;
}

Changes

Here I had a variation of the call-based interpreter, enabled when using clang-19:

  • execute_data and opline are passed as op handler arguments, so they are always in registers unless they are spilled on the stack
  • handlers tail call the next opline handler: function call overhead is eliminated
  • handlers use the preserve_none calling convention: reduces register save/spills.
void execute_ex() {
    execute_data->opline->handler(execute_data);
    // leave interpreter
}

__attribute__((preserve_none))
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_no *opline) {
    // opline is already loaded

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    __attribute__((musttail)) return opline->handler(execute_data, opline);
}

The musttail attribute is used to force tail calling.

Unfortunately musttail rejects calls to function whose signature is not compatible with the caller, so it's not possible to tail call VM helpers that have extra parameters. Instead, we use a trampoline when calling these: The helper returns a struct{opline,handler} (in two registers) which is then tail called by the caller. Since helpers always return (unless they call other helpers), the stack doesn't grow indefinitely:

    // ZEND_VM_DISPATCH_TO_HELPER(zend_cannot_pass_by_ref_helper, _arg_num, arg_num, _arg, arg)
    zend_vm_trampoline t = zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline);
    __attribute__((musttail)) return t.handler(execute_data, t.opline);

I introduce a ZEND_VM_DISPATCH() macro that is used by ZEND_VM_NEXT_OPCODE() and related macros. This macro tail calls the next opline by default. In VM helpers with extra parameters, ZEND_DISPATCH() is redefined to return the trampoline value instead:

#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_NOTAIL
zend_vm_trampoline zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline) {
   ...
}
#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_DEFAULT

Caveats

  • The ABI of __attribute__((preserve_none)) is not stable, so we might not use it in exported functions. This has implications for JIT and user opcode handlers. We might need to generate wrappers with a stable convention.
  • There are now 3 interpreters to test. It may be possible to enable some of the change by default (e.g. passing opline as argument and __attribute__((preserve_none))) to reduce the differences between the call-based interpreter and the clang one.

TODO

  • JIT support
  • Measure the impact of passing opline as argument, without other changes. Maybe do that by default? (Pass opline as argument to opcode handlers in CALL VM #17952)
  • Measure the impact of __attribute__((preserve_none)), without other changes
  • Measure/test on aarch64, x86 (not sure it's supported)

Future scope:

  • Tweak preserve_none / preserve_most / slow paths

PRs

I'm splitting this into smaller PRs:

@dstogov
Copy link
Member

dstogov commented Feb 18, 2025

Interesting work! I suppose this will require special support for JIT.

@arnaud-lb
Copy link
Member Author

Yes this does require some changes to the JIT to accommodate for the new opcode handler signature and how FP/IP are passed around. I plan to implement them unless there are major issues with the current approach.

The fact that preserve_none is an unstable ABI will complicate things a bit. Some possible solutions I have in mind are:

  • Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit
  • Enforce that opcache must be compiled with the same clang version as the php binary
  • Move opcache to Zend/ and embed it in the php binary

The second one seems reasonable to me.

@cmb69
Copy link
Member

cmb69 commented Feb 18, 2025

FWIW: feature request to support guaranteed tail calls for MSVC.

@dstogov
Copy link
Member

dstogov commented Feb 19, 2025

Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit

HYBRID VM generates two handlers for each opcode (C function with standard ABI + non standard GOTO). JIT uses one or the other when suitable. Technically, tail call does the same GOTO, so the same approach might work.

CLANG doesn't support global register variables. LLVM may achieve similar thing, using custom calling convention that pin arguments to registers (this technique used for Haskel, Erlang, HHVM ...). Unfortunately, I didn't found a way to introduce new calling convention without LLVM patching (cool OOP style). Using them in CLANG was also problematic. It was long time ago and may be something is changed.

@dstogov
Copy link
Member

dstogov commented Feb 19, 2025

BTW LLVM/CLANG should support local register variables. So maybe GOTO and HYBRID VMs may be adopted.

arnaud-lb added a commit that referenced this pull request Apr 15, 2025
This changes the signature of opcode handlers in the CALL VM so that the opline
is passed directly via arguments. This reduces the number of memory operations
on EX(opline), and makes the CALL VM considerably faster.

Additionally, this unifies the CALL and HYBRID VMs a bit, as EX(opline) is now
handled in the same way in both VMs.

This is a part of GH-17849.

Currently we have two VMs:

 * HYBRID: Used when compiling with GCC. execute_data and opline are global
   register variables
 * CALL: Used when compiling with something else. execute_data is passed as
   opcode handler arg, but opline is passed via execute_data->opline
   (EX(opline)).

The Call VM looks like this:

    while (1) {
        ret = execute_data->opline->handler(execute_data);
        if (UNEXPECTED(ret != 0)) {
            if (ret > 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE()
                execute_data = EG(current_execute_data);
            } else {       // returned by ZEND_VM_RETURN()
                return;
            }
        }
    }

    // example op handler
    int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
        // load opline
        const zend_op *opline = execute_data->opline;

        // instruction execution

        // dispatch
        // ZEND_VM_NEXT_OPCODE():
        execute_data->opline++;
        return 0; // ZEND_VM_CONTINUE()
    }

Opcode handlers return a positive value to signal that the loop must load a
new execute_data from EG(current_execute_data), typically when entering
or leaving a function.

Here I make the following changes:

 * Pass opline as opcode handler argument
 * Return next opline from opcode handlers
 * ZEND_VM_ENTER / ZEND_VM_LEAVE return opline|(1<<0) to signal that
   execute_data must be reloaded from EG(current_execute_data)

This gives us:

    while (1) {
        opline = opline->handler(execute_data, opline);
        if (UNEXPECTED((uintptr_t) opline & ZEND_VM_ENTER_BIT) {
            opline = opline & ~ZEND_VM_ENTER_BIT;
            if (opline != 0) { // ZEND_VM_ENTER() / ZEND_VM_LEAVE()
                execute_data = EG(current_execute_data);
            } else {           // ZEND_VM_RETURN()
                return;
            }
        }
    }

    // example op handler
    const zend_op * ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_op *opline) {
        // opline already loaded

        // instruction execution

        // dispatch
        // ZEND_VM_NEXT_OPCODE():
        return ++opline;
    }

bench.php is 23% faster on Linux / x86_64, 18% faster on MacOS / M1.

Symfony Demo is 2.8% faster.

When using the HYBRID VM, JIT'ed code stores execute_data/opline in two fixed
callee-saved registers and rarely touches EX(opline), just like the VM.

Since the registers are callee-saved, the JIT'ed code doesn't have to
save them before calling other functions, and can assume they always
contain execute_data/opline. The code also avoids saving/restoring them in
prologue/epilogue, as execute_ex takes care of that (JIT'ed code is called
exclusively from there).

The CALL VM can now use a fixed register for execute_data/opline as well, but
we can't rely on execute_ex to save the registers for us as it may use these
registers itself. So we have to save/restore the two registers in JIT'ed code
prologue/epilogue.

Closes GH-17952
@arnaud-lb arnaud-lb mentioned this pull request May 31, 2025
4 tasks
@arnaud-lb arnaud-lb closed this in 73b98a3 Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants