Coercion of arguments into a solid block of memory #11

sanko · 2025-10-11T20:09:38Z

sanko
Oct 11, 2025
Maintainer

Shower thought time... Internally, would it be faster to coerce args that would end up on the stack into a single block of malloc'd memory? Currently, our trampolines must perform a double-indirection to put data on the stack:

load the pointer (something like mov r15, [r14 + i*8]) to use a scratch register.
dereference that pointer to get the actual data (mov rax, [r15])
move the actual data to the destination register/stack slot (mov rdi, rax)

With cache misses, longer arg lists (> 16?), etc. this is a choke point for speed (honestly, microseconds with millions of calls). Alternatively, if we let actual compiler designers deal with those and just allocate a single block of data that'll end up on the stack with something like sub rsp, #size, loop through all N args and call something like memcpy(block + offset, args[N], size); which would be more work being done in C but...

I'll have to benchmark moving our current data gathering step from the JIT path vs. having it done by actual compiler designers... For a 'typical' system this would be faster in theory but would it be faster enough to make it worth the work breaking all JIT generation code...

sanko · 2025-10-12T00:08:04Z

sanko
Oct 12, 2025
Maintainer Author

Nope. I was wrong.

Stable (double deref):

TAP version 13
1..1
# INFIX_DEBUG: Allocated JIT memory. RW at 0x79188a40d000, RX at 0x79188a3d3000
# INFIX_DEBUG: Memory at 0x79188a3d3000 is ready for execution.
# Forward Trampoline Machine Code (size: 58 bytes)
#   0x0000: 55 48 89 e5 41 54 41 55  41 56 41 57 49 89 fc 49 | UH..ATAUAVAWI..I
#   0x0010: 89 f5 49 89 d6 4d 8b 3e  49 63 3f 4d 8b 7e 08 49 | ..I..M.>Ic?M.~.I
#   0x0020: 63 37 4d 85 e4 75 02 0f  0b 49 ff d4 41 89 45 00 | c7M..u...I..A.E.
#   0x0030: 41 5f 41 5e 41 5d 41 5c  5d c3                   | A_A^A]A\].
# End of Forward Trampoline Machine Code
# infix Call Overhead Benchmark
# Iterations: 10000000
# Target function: int(int, int)
# Direct Call Time: 0.0020 s (0.20 ns/call)
# infix (Unbound):    0.0332 s (3.32 ns/call) -> Overhead: ~3.12 ns
# INFIX_DEBUG: Allocated JIT memory. RW at 0x79188a3d2000, RX at 0x79188a3d1000
# INFIX_DEBUG: Memory at 0x79188a3d1000 is ready for execution.
# Forward Trampoline Machine Code (size: 65 bytes)
#   0x0000: 55 48 89 e5 41 54 41 55  41 56 41 57 49 89 fd 49 | UH..ATAUAVAWI..I
#   0x0010: 89 f6 4d 8b 3e 49 63 3f  4d 8b 7e 08 49 63 37 49 | ..M.>Ic?M.~.Ic7I
#   0x0020: bc 80 36 0e dc 69 57 00  00 4d 85 e4 75 02 0f 0b | ..6..iW..M..u...
#   0x0030: 49 ff d4 41 89 45 00 41  5f 41 5e 41 5d 41 5c 5d | I..A.E.A_A^A]A\]
#   0x0040: c3                                               | .
# End of Forward Trampoline Machine Code
# infix (Bound):      0.0367 s (3.67 ns/call) -> Overhead: ~3.46 ns
# dyncall benchmarking was not enabled.
ok 1 - Benchmark completed (final accumulator value: 829341696)

Dev (gather and marshall):

TAP version 13
1..1
# INFIX_DEBUG: Allocated JIT memory. RW at 0x7e5eb5e8f000, RX at 0x7e5eb5e55000
# INFIX_DEBUG: Memory at 0x7e5eb5e55000 is ready for execution.
# Forward Trampoline Machine Code (size: 90 bytes)
#   0x0000: 55 48 89 e5 41 54 41 55  41 56 41 57 49 89 fc 49 | UH..ATAUAVAWI..I
#   0x0010: 89 f5 49 89 d6 48 81 ec  20 00 00 00 4d 8b 3e 49 | ..I..H.. ...M.>I
#   0x0020: 8b 07 48 89 04 24 4d 8b  7e 08 49 8b 07 48 89 44 | ..H..$M.~.I..H.D
#   0x0030: 24 10 48 63 3c 24 48 63  74 24 10 4d 85 e4 75 02 | $.Hc<$Hct$.M..u.
#   0x0040: 0f 0b 49 ff d4 41 89 45  00 48 81 c4 20 00 00 00 | ..I..A.E.H.. ...
#   0x0050: 41 5f 41 5e 41 5d 41 5c  5d c3                   | A_A^A]A\].
# End of Forward Trampoline Machine Code
# infix Call Overhead Benchmark
# Iterations: 10000000
# Target function: int(int, int)
# Direct Call Time: 0.0042 s (0.42 ns/call)
# infix (Unbound):    0.0725 s (7.25 ns/call) -> Overhead: ~6.83 ns
# INFIX_DEBUG: Allocated JIT memory. RW at 0x7e5eb5e54000, RX at 0x7e5eb5e53000
# INFIX_DEBUG: Memory at 0x7e5eb5e53000 is ready for execution.
# Forward Trampoline Machine Code (size: 97 bytes)
#   0x0000: 55 48 89 e5 41 54 41 55  41 56 41 57 49 89 fd 49 | UH..ATAUAVAWI..I
#   0x0010: 89 f6 48 81 ec 20 00 00  00 4d 8b 3e 49 8b 07 48 | ..H.. ...M.>I..H
#   0x0020: 89 04 24 4d 8b 7e 08 49  8b 07 48 89 44 24 10 48 | ..$M.~.I..H.D$.H
#   0x0030: 63 3c 24 48 63 74 24 10  49 bc 80 a6 ce 69 56 56 | c<$Hct$.I....iVV
#   0x0040: 00 00 4d 85 e4 75 02 0f  0b 49 ff d4 41 89 45 00 | ..M..u...I..A.E.
#   0x0050: 48 81 c4 20 00 00 00 41  5f 41 5e 41 5d 41 5c 5d | H.. ...A_A^A]A\]
#   0x0060: c3                                               | .
# End of Forward Trampoline Machine Code
# infix (Bound):      0.0706 s (7.06 ns/call) -> Overhead: ~6.65 ns
# dyncall benchmarking was not enabled.
ok 1 - Benchmark completed (final accumulator value: 829341696)

Ran it several times back and forth and they both benchmark the same (~3ns vs. ~7ns). I'm not even going to bother with Win64 or the pain of ARM.

I understand I could switch to this bulk move only when fields are expected to really make good use of the stack but ~~for~~ brief trampolines like our test shouldn't be this slow. Premature optimization strikes again...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coercion of arguments into a solid block of memory #11

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Coercion of arguments into a solid block of memory #11

Uh oh!

Uh oh!

sanko Oct 11, 2025 Maintainer

Replies: 1 comment

Uh oh!

Uh oh!

sanko Oct 12, 2025 Maintainer Author

sanko
Oct 11, 2025
Maintainer

sanko
Oct 12, 2025
Maintainer Author