Replies: 2 comments
-
I see that this is double precision AVX, are you comparing against single precision SSE? I'd expect not much difference as they are both 4 wide, AVX might even be slower since it uses more memory. The AVX version should be more accurate if nothing else :) I agree that the memory latency is the issue, Bullet just isn't very SoA friendly the way it is currently written, to truly benefit from AVX(especially single precision 8 wide, and AVX512 at 16 wide) large parts of Bullet would need to be rewritten. |
Beta Was this translation helpful? Give feedback.
-
No, I compared against the double precision scalar version (we always use the double precision version since we simulate on the whole Earth, so big coordinates). Thanks for your feedback, I think your analysis is correct (I came to the same one). Having AVX used for a single matrix/vector transformation, or vector normalization, is not really worth it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Erwin,
I've spent a few days to implement an AVX path (using the SSE one as reference) for the double-precision version of Bullet.
Most of the functions that have a SSE implementation now also have a AVX code path - apart 2-3 of them (search "TODO_AVX" in the source code to find them out).
Actually, before spending some more time on implementing an AVX path on these big function (maxdot/mindot), I decided to benchmark the current implementation (i.e. all methods/functions having a SSE implementation now have a AVX path as well, apart the 2-3 mentionned above). The gain was negligible on my test scenario.
So, there is a reference AVX implementation available, it's correct and working well, but actually doesn't provide any significant improvement (at least, not enough to accept the compatibility break with old non-AVX CPUs). At least not on our typical scenario.
I assume the memory latency overhead is huge compared to the few SIMD instructions performed.
The commit can be found in my fork on github : gjaegy@f2e7c78
I'm happy to answer any question in the case you want to have a look at this work. Maybe someone with a better AVX expertise could have a look and potentially find some way to speed it up.
Beta Was this translation helpful? Give feedback.
All reactions