Improve performance for Vector64.ExtractMostSignificantBits #115288
-
SummaryThe performance of Vector64.ExtractMostSignificantBits is very poor compared to the other vector classes. Despite having the [Intrinsic] attribute, there is no hardware support enabled for the method on i386 (just ARM). Also, while the existing implementation could serve as a fallback, there's room for improvement there too. Current Implementation
Proposed Change
BenefitsUsing intrinsics instead of 'base code in a loop' greatly improves the performance of this method. The 16bit movemasks require a slightly more advanced instruction set (Ssse3). As this was introduced back in 2006, most modern computers should have this support. This proposed change also improves the performance of the 'fallback' code. Comparing the existing net9.0 code with updating the net10.0 code gives us (times in seconds):
Potential ConsiderationsWhile my test code (which is what I used to produce the stats) exercises all the applicable permutations, I don't have a real-world test to plug this into to check the performance. It's possible that (somehow) this could produce worse results under certain circumstances. Running Vector64Tests.cs shows no changes, but that's hardly real world code either. Promoting a 64bit value to 128 bits to calculate the move mask might seem counterintuitive. It's true there is an instruction intended to operate on 64bit values (exposed by the library as ParallelBitExtract). However it requires Bmi2, which is a much newer instruction set than Sse2, meaning fewer computers support it. What's more, it has known performance issues on Zen1/Zen2 computers. And lastly, my tests show that it's slower than the 128 bit alternatives (at least on my machine). ConclusionIt's a small, self-contained change. While there are more "lines of code" here than the original implementation, they mostly drop out at JIT compile time, leaving behind fewer asm instructions. I expect it to be faster in all circumstances. I have other performance changes in mind for Vector64.cs, but let's see how this one is received first. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It's applicable to most |
Beta Was this translation helpful? Give feedback.
This.
You should, in general, be checking
Vector64.IsHardwareAccelerated
(or the correspondingIsHardwareAccelerated
for other vector sizes) prior to its use.If it returns
false
, then it is likely going to be executing a less efficient path for most, if not all, APIs.