Skip to content

Conversation

@illwieckz
Copy link
Member

@illwieckz illwieckz commented Sep 29, 2025

  • renderer: introduce R_TBNtoQtangentsFast() and floatToSnorm16_fast()
  • renderer: faster Tess_SurfaceIQM() and Tess_SurfaceMD5() with R_TBNtoQtangentsFast()

I noticed that the the Tess_SurfaceIQM() function (used in CPU model code) was spending significant time in R_QtangentsToTBN() and this was spending significant time in floatToSnorm16(). Then I discovered that this function was doing a rounding to the closest int on every component of the vector, calling lrintf() on every component one by one. But, this is model rendering code, we may not need that much exactitude, and a basic truncate can likely do the job as well and the result is likely good enough. Also, this CPU code is a fallback for when low-end devices can't process the current model on GPU, so the player is already using a low preset with low LOD and then, already gets worse results than that in other rendered things. Using a basic truncate means the compiler can vectorize, and it does.

I reported the trick to Tess_SurfaceMD5() as well.

Other model code (like the MD3 code) using R_QtangentsToTBN() at load time still use the properly-rounded, lrintf-based variant. Same for the IQM/MD5 loading code for the GPU code path.

Before, 97fps, with generated code:

		floatToSnorm16( q, resqtangent );
0x5ae46f35db61:	mulss        xmm0, dword ptr [rip + 0x2b6537]
0x5ae46f35db69:	movss        dword ptr [rsp + 8], xmm3
0x5ae46f35db6f:	movss        dword ptr [rsp + 4], xmm2
0x5ae46f35db75:	movss        dword ptr [rsp + 0xc], xmm1
0x5ae46f35db7b:	call         0x5ae46f1c3440 (???)
0x5ae46f35db80:	movss        xmm1, dword ptr [rsp + 0xc]
0x5ae46f35db86:	movss        xmm0, dword ptr [rip + 0x2b6512]
0x5ae46f35db8e:	mov          rbp, rax
0x5ae46f35db91:	mulss        xmm0, xmm1
0x5ae46f35db95:	call         0x5ae46f1c3440 (???)
0x5ae46f35db9a:	movss        xmm2, dword ptr [rsp + 4]
0x5ae46f35dba0:	movss        xmm0, dword ptr [rip + 0x2b64f8]
0x5ae46f35dba8:	mov          r12, rax
0x5ae46f35dbab:	mulss        xmm0, xmm2
0x5ae46f35dbaf:	call         0x5ae46f1c3440 (???)
0x5ae46f35dbb4:	movss        xmm0, dword ptr [rip + 0x2b64e4]
0x5ae46f35dbbc:	movss        xmm3, dword ptr [rsp + 8]
0x5ae46f35dbc2:	mov          r13, rax
0x5ae46f35dbc5:	mulss        xmm0, xmm3
0x5ae46f35dbc9:	call         0x5ae46f1c3440 (???)

After, 103fps, with generated code:

		floatToSnorm16_fast( q, resqtangent );
0x6132db310b42:	mulss        xmm0, dword ptr [rip + 0x2b6556]
	if ( fast )
0x6132db310b4a:	test         bpl, bpl
0x6132db310b4d:	je           0x6132db310ce0
0x6132db310b53:	mulss        xmm3, dword ptr [rip + 0x2b6545]
0x6132db310b5b:	cvttss2si    r14d, xmm0
0x6132db310b60:	mulss        xmm4, dword ptr [rip + 0x2b6538]
0x6132db310b68:	mulss        xmm5, dword ptr [rip + 0x2b6530]
0x6132db310b70:	cvttss2si    r13d, xmm3
0x6132db310b75:	cvttss2si    ebp, xmm4
0x6132db310b79:	cvttss2si    eax, xmm5
  • renderer: faster Tess_SurfacePolychain() with R_TBNtoQtangentsFast()

I don't know what Tess_SurfacePolychain() is used for, so I don't know if it's safe to use R_TBNtoQtangentsFast() there.
I noticed this is another function that is called at render time, so making it faster can reduce frametime as well.

  • renderer: faster Tess_SurfacePolychain() with VectorNormalizeFast()
  • renderer: faster Tess_SurfacePolychain() with VectorNormalizeFast() in R_CalcTangents()
  • renderer: faster other R_CalcTangents() with VectorNormalizeFast()

It is safe to use VectorNormalizeFast() anytime VectorNormalize() doesn't return anything (no difference in the final computation, just less branching).

I don't know what uses that other R_CalcTangents() variant, but we can blindly use VectorNormalizeFast() when there is no return, so let's do it as well.

@illwieckz
Copy link
Member Author

I actually wonder if we really need that rounding with lrintf() via floatToSnorm16() in R_TBNtoQtangents(), if we don't need that, we can even skip the test. But the, I have no idea what's the real drawback of truncating instead of rounding. I just assume that for rendering a model it can't be that bad.

@illwieckz illwieckz force-pushed the illwieckz/faster-tbntoqtangents branch from f023987 to ad8c15c Compare September 29, 2025 18:40
@slipher
Copy link
Member

slipher commented Sep 30, 2025

LGTM

Screenshots with r_vboModels 0 generally came out pixel-for-pixel identical before/after these changes.

@illwieckz
Copy link
Member Author

I looked at Tess_SurfacePolychain() and it is used to set-up fog volumes, and it looks like the cgame trail code can also make use of it. So using a faster truncate code here instead of a precise rounding should be fine as well. It's not like if we were doing entity collisions or things like that.

@illwieckz illwieckz merged commit 4ff4427 into master Sep 30, 2025
9 checks passed
@illwieckz illwieckz deleted the illwieckz/faster-tbntoqtangents branch September 30, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants