workaround cdt's small const memcpy host function calls in EOS VM OC#1016
workaround cdt's small const memcpy host function calls in EOS VM OC#1016spoonincode merged 5 commits intomainfrom
Conversation
| const intrinsic_map_t& get_intrinsic_map(); | ||
|
|
||
| static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1; | ||
| static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128; |
There was a problem hiding this comment.
128 is a bit of a dice roll. I don't have data to know where the exact threshold is between this local copy being faster vs slower than the full call off to memcpy. I wouldn't want to go higher than this anyways though, since it bloats the code a little more.
|
|
||
| static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1; | ||
| static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128; | ||
|
|
There was a problem hiding this comment.
This is a weird place for this, but header files for OC are a little weird because of some parts still are compiled with C++17 -- need to place these in a header file that is consumed by the C++17 code.
| const_memcpy_sz->getZExtValue() >= minimum_const_memcpy_intrinsic_to_optimize && | ||
| const_memcpy_sz->getZExtValue() <= maximum_const_memcpy_intrinsic_to_optimize) { | ||
| const unsigned sz_value = const_memcpy_sz->getZExtValue(); | ||
| llvm::IntegerType* type_of_memcpy_width = llvm::Type::getIntNTy(context, sz_value*8); |
There was a problem hiding this comment.
The use of a non-power-of-2 type size here is rather unorthodox, though one would have to imagine something like _BitInt from C23 would do exactly this. Unit tests exhaustively explore every size (1 through 128). An oddball size does generate mostly expected machine code (loads and stores are maybe not in "natural" order one would expect but that doesn't matter of course), for example for size 13:
pushq %rax
decl %gs:-18888
je 83
movl %gs:208, %eax ;;load 4 bytes (208-211)
movq %gs:200, %rcx ;;load 8 bytes (200-207)
movb %gs:212, %dl ;;load 1 byte (212)
movb %dl, %gs:16 ;;store 1 byte (16)
movq %rcx, %gs:4 ;;store 8 bytes (4-11)
movl %eax, %gs:12 ;;store 4 bytes (12-15)
movl $4, %edi
movl $200, %esi
movl $13, %edx
callq *%gs:-21056
incl %gs:-18888
popq %rax
retq
callq *%gs:-18992| const unsigned ulength = length; | ||
|
|
||
| //this must remain the same behavior as the memcpy host function | ||
| if((size_t)(std::abs((ptrdiff_t)udest - (ptrdiff_t)usrc)) >= ulength) |
There was a problem hiding this comment.
This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.
There was a problem hiding this comment.
wonder if
if ((std::max(udest, usrc) - std::min(udest, usrc)) >= ulength)
might be faster?
There was a problem hiding this comment.
This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.
Did you try plain if comparisons without using std::abs?
There was a problem hiding this comment.
Both of these interesting suggestions and worth trying. Though, I'd suggest we keep the existing impl initially as it's most like what already exists (and has already been performance tested at the moment).
I'm also curious what would happen if the check was moved completely inline. The obviously bad aspect of this is that it's going to put a lot of pressure on the branch prediction cache. But.. there is a non-trivial amount of work just to get the values in to the right registers to pass off to a standard ABI function; then again maybe these get pipelined well. Plenty to experiment with here.
|
Note:start |
|
exhaustive LLVM run here, chicken dance pending |
| }; | ||
|
|
||
| static constexpr uint8_t current_codegen_version = 1; | ||
| static constexpr uint8_t current_codegen_version = 2; |
There was a problem hiding this comment.
This small change will effectively purge a pre-1.1 cache since on load,
spring/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp
Lines 345 to 354 in 3984428
|
Replay test completed successfully |
AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.
This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.
As a simple example, a contract such as,
Will generate machine code (annotated by me)
(in practice the source and destination addresses are typically not constants)
For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.
A run of the LLVM exhaustive workflow is at,
https://github.com/AntelopeIO/spring/actions/runs/11711214896