Replies: 1 comment
-
|
Nope. I was wrong. Stable (double deref): Dev (gather and marshall): Ran it several times back and forth and they both benchmark the same (~3ns vs. ~7ns). I'm not even going to bother with Win64 or the pain of ARM. I understand I could switch to this bulk move only when fields are expected to really make good use of the stack but |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Shower thought time... Internally, would it be faster to coerce args that would end up on the stack into a single block of malloc'd memory? Currently, our trampolines must perform a double-indirection to put data on the stack:
mov r15, [r14 + i*8]) to use a scratch register.mov rax, [r15])mov rdi, rax)With cache misses, longer arg lists (> 16?), etc. this is a choke point for speed (honestly, microseconds with millions of calls). Alternatively, if we let actual compiler designers deal with those and just allocate a single block of data that'll end up on the stack with something like
sub rsp, #size, loop through allNargs and call something likememcpy(block + offset, args[N], size);which would be more work being done in C but...I'll have to benchmark moving our current data gathering step from the JIT path vs. having it done by actual compiler designers... For a 'typical' system this would be faster in theory but would it be faster enough to make it worth the work breaking all JIT generation code...
Beta Was this translation helpful? Give feedback.
All reactions