Conversation
Also move from rep movsd to rep movsb, which should be faster on modern hardware.
|
Also for the record: Apparently during the testing the current libc This test program exposes it: Which produces: I'm not sure if this is why when I stub-out the |
|
Based on my reference memcpy implementation included in the push, I shaved off ~1-2 minutes of a Which is a 10% reduction(ish), which is nice ;). |
|
pretty cool. Thank you! |
|
Are the AppVeyor failures related? If not: 10% saved and tests otherwise passing seems like a no-brainer to get the new memcpy in. Just a related note: I've found that in GCC there's often a bad performance if the number of bytes are not known to the compiler; for small amounts (I think it was less than 256 bytes) the general implementation seems to be less good than a simple loop that the optimizer improves - to an amount that I've made an explicit switch between both depending on the size in libcob. |
|
the current appveyor faiilures are because I made changes that i didn't quite vet properly. I'm trying to get it worked out but running into one problem after another. Soon, I guess.... Another aspect of what I checked in is it may be a little slower overall. Part of my current vetting is also to see if I can address that... |
|
@GitMensch So I think it may be partly me (I have no idea why this would break it, it fixes issues if anything)? |
|
https://www.microsoft.com/en-us/msrc/blog/2021/01/building-faster-amd64-memset-routines The goal here should be to knock out any commonly used expensive libc function into being slightly faster. |
|
Hmmm.... I found a crash in the inline void* memmove_backwards(void* s1, const void* s2, size_t sz)
{
char* dest_char = s1;
const char* src_char = s2;
size_t num_loop = sz / 64;
/*
while (num_loop > 0)
{
num_loop -= 4;
__m128i loaded_val = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop));
__m128i loaded_val1 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 1));
__m128i loaded_val2 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 2));
__m128i loaded_val3 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 3));
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 3), loaded_val3);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 2), loaded_val2);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 1), loaded_val1);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop), loaded_val);
}
*/
sz -= num_loop * 64;
while (num_loop > 0)
{
num_loop -= 4;
__asm {
lea ecx, num_loop
movups xmm0, [src_char + ecx];
movups [dest_char + ecx], xmm0;
}
}
while (sz > 0)
{
sz--;
((char*)s1)[sz] = ((const char*)s2)[sz];
}
return s1;
}This crashes the compiler when dealing with everything. You'll notice that I'm trying to workaround an issue where the following block of code doesn't work: As my assumption there is that This is compounded by the ability of MSVC to happily accept this block of code. I'm not sure if this is a "I don't know enough about the NASM syntax" issue however. |
|
oh i just realized what is problably wrong with the 'lea' instructions. nasm syntax does not really understand things like: mov eax, my_variable I hadn't thought about that aspect of masm in a very long time.... instead I think you have to do mov eax, [my_variable] thsi is actually something that is probably easy to adjust in the inline assembly parser.... it just needs to notice what is happening and add the brackets internally. if you think it would be good for compatbility I can address it. |
|
I think that we could tackle this two ways, both are acceptable: Accept I think that's the best solution. |
The hot loop is now ~6 instructions as opposed to the old ~8, while doing 4x the amount of compares. Small strings may be slightly slower as the cold loop is there to align the speeds up faster, so strings < 15 in length may be slower. Anything > 32 in length should be faster compared to previously, however.
|
I'm somewhat stalling out on this at the moment, but I might get back to this next week. I am still unhappy with my Past that, I think I might aim at Outside of that, I am interested in trying to learn llvm's loop transformations, such as matching patterns that will modify a basic block to a function call (in particular, something like optimizing a basic memcpy to a call to the memcpy lib if the size of the memcpy is unknown or greater than a certain value). |
|
so i think it is valuable that you are doing this 😄 |
|
I'm kinda-sorta stalling out on this while working on my benchmark lib because On that note, for adding little utilities that don't quite fit with the Utils/ folder, should I add a |
This is my incubation area for optimization, so far I'm including my
gen.cppchanges while I try figure out what in the universe is going on with mymemcpyimplementation.I'll be pushing my
memcpyimplementation later, but for now this should be a good reference point to compare the appveyor build times with as I suspect this should increase the speeds of the compiler slightly, but not majorly.