btVector3 use of union with __m128 causes suboptimal code gen #3594

TrianglesPCT · 2019-01-19T02:39:58Z

TrianglesPCT
Jan 19, 2019

Here is a godbolt link demonstrating: https://godbolt.org/z/gyyvcz

You can set USE_UNION to 0 to see how it improves code gen

With MSVC you can directly access this data already via mVec128.m128_f32, but I believe this isn't available with GCC so you probably need something like this as a generic SSE extract.

template<int Index>
inline __m128 extract_v(v4float a) {
	return _mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, Index));
	//int float_as_int =  _mm_extract_ps(a); //SSE 4.1
}
//Get element at index of SIMD vector
template<int Index>
inline float extract(v4float a) {
	return _mm_cvtss_f32(extract_v(a));
}

Replace all the direct access to m_float with .x(),.y() etc member functions to make it more portable--

I'm not using arm neon, but it appears there is vgetq_lane_f32 for extraction, and vsetq_lane_f32 for setting.

I made this change to my version of Bullet, but it has other changes so I wanted to see if this was something Erwin(or whatever makes these decisions) was okay with before submitting anything, since I'd have to redo it in Bullets master branch.

This issue is more obvious when you have __vectorcall as the default calling convention, since it basically disables __vectorcall.

btQuaterion has the same issue.

gjaegy · 2019-07-04T08:22:33Z

gjaegy
Jul 4, 2019

That's a very interesting finding in my opinion.

Have you received any feedback from Erwin ?

I think your suggestion should be implemented since providing a significant performance win (I'm also trying to speed up bullet a bit through AVX - we use the double-precision version here)

0 replies

TrianglesPCT · 2019-07-09T21:09:02Z

TrianglesPCT
Jul 9, 2019
Author

Hi gjaegy, I haven't heard anything about this topic.

Bullet is probably mostly memory bound, so without a large rewrite to make the entire thing much more SoA friendly it probably doesn't make a huge difference anyway.

0 replies

gjaegy · 2019-07-10T08:16:02Z

gjaegy
Jul 10, 2019

Yes, I think you are right, I came to the same conclusion after having done that AVX code path implementation :/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

btVector3 use of union with __m128 causes suboptimal code gen #3594

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

btVector3 use of union with __m128 causes suboptimal code gen #3594

Uh oh!

Uh oh!

TrianglesPCT Jan 19, 2019

Replies: 3 comments

Uh oh!

gjaegy Jul 4, 2019

Uh oh!

TrianglesPCT Jul 9, 2019 Author

Uh oh!

gjaegy Jul 10, 2019

TrianglesPCT
Jan 19, 2019

gjaegy
Jul 4, 2019

TrianglesPCT
Jul 9, 2019
Author

gjaegy
Jul 10, 2019