Conversation
… that this is better, at least on AMD A10-5800)
|
Thanks for your contribution!. Mask uses 9 bits, so after packing vector elements to int8 you are loosing one bit. You can fix this by calling _mm_packs_epi16 on result of _mm_cmpeq_epi16. BTW, I have few other optimizations waiting on my PC, which are not pushed here yet. One of them was to change type of elements in squareA_MaskT to uint16_t, so one SSE vector can hold whole row, like AVX2 does now. By looking on your changes I have realized that code can be optimized further, by using packs instruction. Thanks again! |
|
It's a pity that you did not allow to create Issues in the repository. So I'll write here, sorry. It would be useful to use a WUs from https://github.com/sirzooro/RakeSearch/releases/download/v1.0/test.tgz how default workunit with a script to check. Well, or ask @CrystalFrost about this. |
|
I have enabled issues, they were disabled by default (probably inherited this when I forked this repo). I also pushed all my new changes on branch optimizations2. Please take a look, to avoid duplicate work. |
I tried to optimize SSE/AVX version of MovePairSearch::MovePairSearch().
However, I'm not sure about the correctness of the work (why does WU for test something does, but not find anything?).