Improve vectorized string::find_meow_of for small bitmap cases
#6080
+16
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🗺️ Where it is
basic_string[_view]::find_{first|last}_[not_]ofuse bitmap of [0, 255] characters. if they fit.__m256ivariable. It takes few complex instructions per element to populate one bit.This optimization targets small bitmap populating.
🧠 Optimization
The new approach does not explicitly split the element value to low and high part. Instead it relies on the fact that
_mm256_sllv_epi32/vpsllvdwould zero the destination element for shifts greater or equal to destination element bit count. We broadcast the source to all bits of AVX2 vector, and xor high 3 bits with incrementing pattern, so exactly one element will become less than 32, and that will shift the corresponding one to the shift value.The new approach has approximately the same cost in vector instructions, but it saves all scalar steps, which are about four logical/shift instructions.
Instead of using 32-bit elements, and splitting the source value to 3 high bits and 5 low bits, an alternative with 64-bits and 2 high / 6 low bits is possible. This alternative has exactly the same performance properties, but is a bit more squirrelly to get working on 32-bit x86, so there's no point in doing that. (In contrast, the old approach used 64-bit elements, and had 32-bit alternative, and that alternative was also hard to get to 32-bit x86).
✊ Force inline
There's codegen issue, where the compiler inserts non-VEX prefixed SSE on AVX2 path in function epilog. This turned the optimization to a major pessimization. The easiest way around was to eliminate that epilog, along with the function, by forced inline. Otherwise, the performance impact of forced inline seems also positive.
⚖️ Balance change
As one of algorithms became faster, where others stayed the same, we can adjust threshold to get the maximum of it. Though this is more tedious than the optimization itself, and without tuning them, things won't be worse, just possible missed opportunity.
⏱️ Benchmark results
Featured results:
bm<AlgType::str_member_first, char>/1011/11bm<AlgType::str_member_first, wchar_t>/325/1bm<AlgType::str_member_first, wchar_t>/1011/11bm<AlgType::str_member_last, char>/1011/11bm<AlgType::str_member_last, wchar_t>/325/1bm<AlgType::str_member_first_not, char>/1011/11bm<AlgType::str_member_first_not, wchar_t>/325/1bm<AlgType::str_member_first_not, wchar_t>/1011/11bm<AlgType::str_member_last_not, char>/1011/11bm<AlgType::str_member_last_not, wchar_t>/325/1bm<AlgType::str_member_last_not, wchar_t>/1011/11All results
bm<AlgType::str_member_first, char>/2/3bm<AlgType::str_member_first, char>/6/81bm<AlgType::str_member_first, char>/7/4bm<AlgType::str_member_first, char>/9/3bm<AlgType::str_member_first, char>/22/5bm<AlgType::str_member_first, char>/58/2bm<AlgType::str_member_first, char>/75/85bm<AlgType::str_member_first, char>/102/4bm<AlgType::str_member_first, char>/200/46bm<AlgType::str_member_first, char>/325/1bm<AlgType::str_member_first, char>/400/50bm<AlgType::str_member_first, char>/1011/11bm<AlgType::str_member_first, char>/1280/46bm<AlgType::str_member_first, char>/1502/23bm<AlgType::str_member_first, char>/2203/54bm<AlgType::str_member_first, char>/3056/7bm<AlgType::str_member_first, wchar_t>/2/3bm<AlgType::str_member_first, wchar_t>/6/81bm<AlgType::str_member_first, wchar_t>/7/4bm<AlgType::str_member_first, wchar_t>/9/3bm<AlgType::str_member_first, wchar_t>/22/5bm<AlgType::str_member_first, wchar_t>/58/2bm<AlgType::str_member_first, wchar_t>/75/85bm<AlgType::str_member_first, wchar_t>/102/4bm<AlgType::str_member_first, wchar_t>/200/46bm<AlgType::str_member_first, wchar_t>/325/1bm<AlgType::str_member_first, wchar_t>/400/50bm<AlgType::str_member_first, wchar_t>/1011/11bm<AlgType::str_member_first, wchar_t>/1280/46bm<AlgType::str_member_first, wchar_t>/1502/23bm<AlgType::str_member_first, wchar_t>/2203/54bm<AlgType::str_member_first, wchar_t>/3056/7bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/6/81bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/75/85bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/200/46bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1280/46bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2203/54bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7bm<AlgType::str_member_first, char32_t>/2/3bm<AlgType::str_member_first, char32_t>/6/81bm<AlgType::str_member_first, char32_t>/7/4bm<AlgType::str_member_first, char32_t>/9/3bm<AlgType::str_member_first, char32_t>/22/5bm<AlgType::str_member_first, char32_t>/58/2bm<AlgType::str_member_first, char32_t>/75/85bm<AlgType::str_member_first, char32_t>/102/4bm<AlgType::str_member_first, char32_t>/200/46bm<AlgType::str_member_first, char32_t>/325/1bm<AlgType::str_member_first, char32_t>/400/50bm<AlgType::str_member_first, char32_t>/1011/11bm<AlgType::str_member_first, char32_t>/1280/46bm<AlgType::str_member_first, char32_t>/1502/23bm<AlgType::str_member_first, char32_t>/2203/54bm<AlgType::str_member_first, char32_t>/3056/7bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3bm<AlgType::str_member_first, char32_t, U'\x03B1'>/6/81bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2bm<AlgType::str_member_first, char32_t, U'\x03B1'>/75/85bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4bm<AlgType::str_member_first, char32_t, U'\x03B1'>/200/46bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1280/46bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2203/54bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7bm<AlgType::str_member_last, char>/2/3bm<AlgType::str_member_last, char>/6/81bm<AlgType::str_member_last, char>/7/4bm<AlgType::str_member_last, char>/9/3bm<AlgType::str_member_last, char>/22/5bm<AlgType::str_member_last, char>/58/2bm<AlgType::str_member_last, char>/75/85bm<AlgType::str_member_last, char>/102/4bm<AlgType::str_member_last, char>/200/46bm<AlgType::str_member_last, char>/325/1bm<AlgType::str_member_last, char>/400/50bm<AlgType::str_member_last, char>/1011/11bm<AlgType::str_member_last, char>/1280/46bm<AlgType::str_member_last, char>/1502/23bm<AlgType::str_member_last, char>/2203/54bm<AlgType::str_member_last, char>/3056/7bm<AlgType::str_member_last, wchar_t>/2/3bm<AlgType::str_member_last, wchar_t>/6/81bm<AlgType::str_member_last, wchar_t>/7/4bm<AlgType::str_member_last, wchar_t>/9/3bm<AlgType::str_member_last, wchar_t>/22/5bm<AlgType::str_member_last, wchar_t>/58/2bm<AlgType::str_member_last, wchar_t>/75/85bm<AlgType::str_member_last, wchar_t>/102/4bm<AlgType::str_member_last, wchar_t>/200/46bm<AlgType::str_member_last, wchar_t>/325/1bm<AlgType::str_member_last, wchar_t>/400/50bm<AlgType::str_member_last, wchar_t>/1011/11bm<AlgType::str_member_last, wchar_t>/1280/46bm<AlgType::str_member_last, wchar_t>/1502/23bm<AlgType::str_member_last, wchar_t>/2203/54bm<AlgType::str_member_last, wchar_t>/3056/7bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/6/81bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/75/85bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/200/46bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1280/46bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2203/54bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7bm<AlgType::str_member_first_not, char>/2/3bm<AlgType::str_member_first_not, char>/6/81bm<AlgType::str_member_first_not, char>/7/4bm<AlgType::str_member_first_not, char>/9/3bm<AlgType::str_member_first_not, char>/22/5bm<AlgType::str_member_first_not, char>/58/2bm<AlgType::str_member_first_not, char>/75/85bm<AlgType::str_member_first_not, char>/102/4bm<AlgType::str_member_first_not, char>/200/46bm<AlgType::str_member_first_not, char>/325/1bm<AlgType::str_member_first_not, char>/400/50bm<AlgType::str_member_first_not, char>/1011/11bm<AlgType::str_member_first_not, char>/1280/46bm<AlgType::str_member_first_not, char>/1502/23bm<AlgType::str_member_first_not, char>/2203/54bm<AlgType::str_member_first_not, char>/3056/7bm<AlgType::str_member_first_not, wchar_t>/2/3bm<AlgType::str_member_first_not, wchar_t>/6/81bm<AlgType::str_member_first_not, wchar_t>/7/4bm<AlgType::str_member_first_not, wchar_t>/9/3bm<AlgType::str_member_first_not, wchar_t>/22/5bm<AlgType::str_member_first_not, wchar_t>/58/2bm<AlgType::str_member_first_not, wchar_t>/75/85bm<AlgType::str_member_first_not, wchar_t>/102/4bm<AlgType::str_member_first_not, wchar_t>/200/46bm<AlgType::str_member_first_not, wchar_t>/325/1bm<AlgType::str_member_first_not, wchar_t>/400/50bm<AlgType::str_member_first_not, wchar_t>/1011/11bm<AlgType::str_member_first_not, wchar_t>/1280/46bm<AlgType::str_member_first_not, wchar_t>/1502/23bm<AlgType::str_member_first_not, wchar_t>/2203/54bm<AlgType::str_member_first_not, wchar_t>/3056/7bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7bm<AlgType::str_member_last_not, char>/2/3bm<AlgType::str_member_last_not, char>/6/81bm<AlgType::str_member_last_not, char>/7/4bm<AlgType::str_member_last_not, char>/9/3bm<AlgType::str_member_last_not, char>/22/5bm<AlgType::str_member_last_not, char>/58/2bm<AlgType::str_member_last_not, char>/75/85bm<AlgType::str_member_last_not, char>/102/4bm<AlgType::str_member_last_not, char>/200/46bm<AlgType::str_member_last_not, char>/325/1bm<AlgType::str_member_last_not, char>/400/50bm<AlgType::str_member_last_not, char>/1011/11bm<AlgType::str_member_last_not, char>/1280/46bm<AlgType::str_member_last_not, char>/1502/23bm<AlgType::str_member_last_not, char>/2203/54bm<AlgType::str_member_last_not, char>/3056/7bm<AlgType::str_member_last_not, wchar_t>/2/3bm<AlgType::str_member_last_not, wchar_t>/6/81bm<AlgType::str_member_last_not, wchar_t>/7/4bm<AlgType::str_member_last_not, wchar_t>/9/3bm<AlgType::str_member_last_not, wchar_t>/22/5bm<AlgType::str_member_last_not, wchar_t>/58/2bm<AlgType::str_member_last_not, wchar_t>/75/85bm<AlgType::str_member_last_not, wchar_t>/102/4bm<AlgType::str_member_last_not, wchar_t>/200/46bm<AlgType::str_member_last_not, wchar_t>/325/1bm<AlgType::str_member_last_not, wchar_t>/400/50bm<AlgType::str_member_last_not, wchar_t>/1011/11bm<AlgType::str_member_last_not, wchar_t>/1280/46bm<AlgType::str_member_last_not, wchar_t>/1502/23bm<AlgType::str_member_last_not, wchar_t>/2203/54bm<AlgType::str_member_last_not, wchar_t>/3056/7bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2/3bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/6/81bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/7/4bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/9/3bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/22/5bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/58/2bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/75/85bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/102/4bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/200/46bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/325/1bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/400/50bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1011/11bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1280/46bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1502/23bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2203/54bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/3056/7🥇 Results interpretation
__forceinlineeffect.wchar_t/325/1should have been forwarded into usualfind, but here we are, and optimized bitmap improved them,