btVector3 use of union with __m128 causes suboptimal code gen #3594
Replies: 3 comments
-
That's a very interesting finding in my opinion. Have you received any feedback from Erwin ? I think your suggestion should be implemented since providing a significant performance win (I'm also trying to speed up bullet a bit through AVX - we use the double-precision version here) |
Beta Was this translation helpful? Give feedback.
-
Hi gjaegy, I haven't heard anything about this topic. Bullet is probably mostly memory bound, so without a large rewrite to make the entire thing much more SoA friendly it probably doesn't make a huge difference anyway. |
Beta Was this translation helpful? Give feedback.
-
Yes, I think you are right, I came to the same conclusion after having done that AVX code path implementation :/ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here is a godbolt link demonstrating: https://godbolt.org/z/gyyvcz
You can set USE_UNION to 0 to see how it improves code gen
With MSVC you can directly access this data already via mVec128.m128_f32, but I believe this isn't available with GCC so you probably need something like this as a generic SSE extract.
Replace all the direct access to m_float with .x(),.y() etc member functions to make it more portable--
I'm not using arm neon, but it appears there is vgetq_lane_f32 for extraction, and vsetq_lane_f32 for setting.
I made this change to my version of Bullet, but it has other changes so I wanted to see if this was something Erwin(or whatever makes these decisions) was okay with before submitting anything, since I'd have to redo it in Bullets master branch.
This issue is more obvious when you have __vectorcall as the default calling convention, since it basically disables __vectorcall.
btQuaterion has the same issue.
Beta Was this translation helpful? Give feedback.
All reactions