Skip to content

Work on optimizations#1112

Open
chuggafan wants to merge 8 commits intoLADSoft:masterfrom
chuggafan:master
Open

Work on optimizations#1112
chuggafan wants to merge 8 commits intoLADSoft:masterfrom
chuggafan:master

Conversation

@chuggafan
Copy link
Contributor

This is my incubation area for optimization, so far I'm including my gen.cpp changes while I try figure out what in the universe is going on with my memcpy implementation.

I'll be pushing my memcpy implementation later, but for now this should be a good reference point to compare the appveyor build times with as I suspect this should increase the speeds of the compiler slightly, but not majorly.

Also move from rep movsd to rep movsb, which should be faster on modern hardware.
@chuggafan
Copy link
Contributor Author

chuggafan commented Jan 18, 2026

Also for the record: Apparently during the testing the current libc memcpy implementation doesn't return the dest address.

This test program exposes it:

#include <string.h>
#include <stdio.h>
#include <inttypes.h>
int main()
{
    char srcdata[12] = {0};
    char destdata[12];
    printf("Before dest addr: %" PRIxPTR ". Before src addr: %" PRIxPTR "\n", destdata, srcdata);
    char* retaddr = memcpy(destdata, srcdata, 12);
    printf("Before dest addr: %" PRIxPTR ". Before src addr: %" PRIxPTR ". Returned: %" PRIxPTR "\n", destdata, srcdata, retaddr);
}

Which produces:

C:\OrangeC\memcpy_tests>occ test.c
occ (OrangeC) Version 7.0.1
Copyright (C) LADSoft 2006-2025

C:\OrangeC\memcpy_tests>test.exe
Before dest addr: 19ff10. Before src addr: 19ff1c
Before dest addr: 19ff10. Before src addr: 19ff1c. Returned: 401066

I'm not sure if this is why when I stub-out the memcpy the compiles start crashing on my end....

@chuggafan
Copy link
Contributor Author

Based on my reference memcpy implementation included in the push, I shaved off ~1-2 minutes of a NORMALcompiles.bat execution on my machine.

Which is a 10% reduction(ish), which is nice ;).

@LADSoft
Copy link
Owner

LADSoft commented Jan 19, 2026

pretty cool. Thank you!

@GitMensch
Copy link
Contributor

Are the AppVeyor failures related? If not: 10% saved and tests otherwise passing seems like a no-brainer to get the new memcpy in.

Just a related note: I've found that in GCC there's often a bad performance if the number of bytes are not known to the compiler; for small amounts (I think it was less than 256 bytes) the general implementation seems to be less good than a simple loop that the optimizer improves - to an amount that I've made an explicit switch between both depending on the size in libcob.

@LADSoft
Copy link
Owner

LADSoft commented Jan 19, 2026

@GitMensch

the current appveyor faiilures are because I made changes that i didn't quite vet properly. I'm trying to get it worked out but running into one problem after another. Soon, I guess....

Another aspect of what I checked in is it may be a little slower overall. Part of my current vetting is also to see if I can address that...

@chuggafan
Copy link
Contributor Author

chuggafan commented Jan 19, 2026

@GitMensch So I think it may be partly me (I have no idea why this would break it, it fixes issues if anything)?
The build is semi-unstable on my machine for some reason.
But yhea, if you note the refsrc I actually just have a really basic optimized loop for sz < 256 then move to rep movsb for everything above that except in the backwards memmove case.
I need to sit down and make some actual attempts at an SSE2 memcpy and also sit down for a custom benchmark.

@chuggafan
Copy link
Contributor Author

https://www.microsoft.com/en-us/msrc/blog/2021/01/building-faster-amd64-memset-routines
Also secondarily I want to take a look at this and see what can be done. Something of note is that even the unaligned instructions should do well.
And this is especially true for rep stosb, which apparently has performance severely degredated by not being 32 byte (possibly 64 byte on some platforms) aligned.
So my order is trying to shove movdqu and movdqa into the sphere, then seeing the performance, then moving towards trying to follow that blogpost for seeing about memset.

The goal here should be to knock out any commonly used expensive libc function into being slightly faster.

@chuggafan
Copy link
Contributor Author

Hmmm.... I found a crash in the occ InstructionParser.cpp:

inline void* memmove_backwards(void* s1, const void* s2, size_t sz)
{
    char* dest_char = s1;
    const char* src_char = s2;
    size_t num_loop = sz / 64;
    /*
    while (num_loop > 0)
    {
        num_loop -= 4;
        __m128i loaded_val = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop));
        __m128i loaded_val1 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 1));
        __m128i loaded_val2 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 2));
        __m128i loaded_val3 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 3));
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 3), loaded_val3);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 2), loaded_val2);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 1), loaded_val1);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop), loaded_val);
    }
    */
    sz -= num_loop * 64;
    while (num_loop > 0)
    {
        num_loop -= 4;
        __asm {
            lea ecx, num_loop
            movups xmm0, [src_char + ecx];
            movups [dest_char + ecx], xmm0;
        }
    }
    while (sz > 0)
    {
        sz--;
        ((char*)s1)[sz] = ((const char*)s2)[sz];
    }
    return s1;
}

This crashes the compiler when dealing with everything.
Here's the stacktrace:

C:\OrangeC\memcpy_tests>occ /S basic_memcpy.c
occ (OrangeC) Version 7.0.1    
Copyright (C) LADSoft 2006-2025
Error(212)    basic_memcpy.c(45):  Use LEA to take address of auto variable
version: 7.0.1
Command Line: "C:\OrangeC\bin\occparse" -! --architecture "x86;lssm:ueduudhrbminsdjnupfji" "/S" "basic_memcpy.c"

Access Violation:(C:\OrangeC\bin\occparse.exe)
CS:EIP 0023:0075D6D3  SS:ESP 002B:0019EE6C
EAX: 00000000  EBX: 0019F4F0  ECX: 01218C54  EDX: 00000100  flags: 00010246
EBP: 0019EF74  ESI: 00000000  EDI: 00401000
 DS:     002B   ES:     002B   FS:     0053   GS:     002B


CS:EIP  C6 40 08 00 E9 7C 02 00 00 8B 45 0C 8B 40 0C 89

Stack trace:
                        75d6d3: InstructionParser::GetInstruction(ocode*, shared_ptr<Instruction>&, list<Numeric*, allocator<Numeric*>>&) + 0x50d  module: instructionparser.cpp, line: 179
                        4ec82e: Parser::AssembleInstruction(ocode*) + 0xd3  module: inasm.cpp, line: 1127
                        4ecd53: Parser::inlineAsm(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x45d  module: inasm.cpp, line: 1252
                        4c8d31: Parser::StatementGenerator::ParseAsm(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x82  module: stmt.cpp, line: 3510
                        4c9873: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x7fe  module: stmt.cpp, line: 3867
                        719cb3: Parser::StatementGenerator::Compound(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x657  module: stmt.cpp, line: 4150
                        4c9462: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x3ed  module: stmt.cpp, line: 3766
                        719606: Parser::StatementGenerator::StatementWithoutNonconst(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x2e  module: stmt.cpp, line: 3974
                        4c7f08: Parser::StatementGenerator::ParseWhile(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x2ee  module: stmt.cpp, line: 3203
                        4c9541: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x4cc  module: stmt.cpp, line: 3790
                        719cb3: Parser::StatementGenerator::Compound(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x657  module: stmt.cpp, line: 4150
                        4cb22e: Parser::StatementGenerator::FunctionBody(bool) + 0x5aa  module: stmt.cpp, line: 4651
                        627f4f: Parser::declare(Parser::sym*, Parser::Type**, Parser::StorageClass, Parser::Linkage, list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool, int, bool, Parser::AccessLevel) + 0x5641  module: declare.cpp, line: 4573
                        401407: Parser::compile(bool) + 0x27e  module: occparse.cpp, line: 452
                        4024a4: main + 0xf48  module: occparse.cpp, line: 728
                        5521fa: __startup + 0x1c6

You'll notice that I'm trying to workaround an issue where the following block of code doesn't work:

C:\OrangeC\memcpy_tests>occ /S basic_memcpy.c
occ (OrangeC) Version 7.0.1
Copyright (C) LADSoft 2006-2025
Error(212)    basic_memcpy.c(44):  Invalid index mode
Error(212)    basic_memcpy.c(45):  Invalid index mode
2 Errors

As my assumption there is that num_loop is in a register or could at least be moved into a register. This code also doesn't work if I decorate num_loop with the register keyword.

This is compounded by the ability of MSVC to happily accept this block of code. I'm not sure if this is a "I don't know enough about the NASM syntax" issue however.

@LADSoft
Copy link
Owner

LADSoft commented Jan 24, 2026

oh i just realized what is problably wrong with the 'lea' instructions.

nasm syntax does not really understand things like:

mov eax, my_variable

I hadn't thought about that aspect of masm in a very long time....

instead I think you have to do

mov eax, [my_variable]
that would go for lea and other instructions as well...

thsi is actually something that is probably easy to adjust in the inline assembly parser.... it just needs to notice what is happening and add the brackets internally. if you think it would be good for compatbility I can address it.

@chuggafan
Copy link
Contributor Author

I think that we could tackle this two ways, both are acceptable:

Accept masm style, perhaps as a switch(?).
For nasm style, if we can detect masm style assembly, have a error/warning/suggestion about how to fix this (i.e. what to change to get it to nasm style).

I think that's the best solution.

The hot loop is now ~6 instructions as opposed to the old ~8, while doing 4x the amount of compares.
Small strings may be slightly slower as the cold loop is there to align the speeds up faster, so strings < 15 in length may be slower.
Anything > 32 in length should be faster compared to previously, however.
@chuggafan
Copy link
Contributor Author

chuggafan commented Feb 12, 2026

I'm somewhat stalling out on this at the moment, but I might get back to this next week.
I'm going to make my own benchmark lib (quick and dirty) and start testing everything at various sizes.

I am still unhappy with my memcpy and memmove implementation, partially because I don't handle aligned/unaligned well, partially because I don't optimize enough for older CPUs, and partially because I'm not optimizing the "reverse case" in memmove where I copy bytes backwards, I need to sit down and start figuring out what I want from first principles harder there, and look at all of my available instructions up to SSE2.

Past that, I think I might aim at memcmp next, it's mostly used in sqlite3 in our codebase, as well as somewhat in the string libcxx library, but even if it's just in sqlite3, that should be a minor win all things considered and is just something that real programs do use.

Outside of that, I am interested in trying to learn llvm's loop transformations, such as matching patterns that will modify a basic block to a function call (in particular, something like optimizing a basic memcpy to a call to the memcpy lib if the size of the memcpy is unknown or greater than a certain value).

@GitMensch
Copy link
Contributor

@LADSoft
Copy link
Owner

LADSoft commented Feb 14, 2026

so i think it is valuable that you are doing this 😄

@chuggafan
Copy link
Contributor Author

I'm kinda-sorta stalling out on this while working on my benchmark lib because picobench and gbenchmark are both not compiling.
I do actually have my little memcmp done, glibc doesn't bother with SSE2 at all for memcmp, I'll probably investigate doing a similar methodology to them, and comparing speeds with this little benchmark library.

On that note, for adding little utilities that don't quite fit with the Utils/ folder, should I add a maint/ folder (top level), just as "things maintainers might want but not people just randomly checking out"?
This is pretty traditional in large projects just as random maintainer stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants