Change built-in codegen to remove contract from transcendetals #610

adelejjeh · 2025-11-17T22:11:15Z

This PR is intended for testing only before upstream PR gets opened.

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (llvm#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @__ocml_log_f32(float noundef %0) #23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
  %2 = tail call float @llvm.log.f32(float %0)
  ret float %2
}

New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiply that by ln(2), and apply some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of log), it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.

This ensures transcendental intrinsics never have the contract flag, preventing incorrect FMA formation during polynomial expansion regardless of -ffp-contract settings.

z1-cciauto · 2025-11-17T22:12:40Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2841

z1-cciauto · 2025-11-18T22:22:44Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2856

z1-cciauto · 2025-11-18T22:56:33Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2857

z1-cciauto · 2025-11-19T17:04:51Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2873

z1-cciauto · 2025-11-19T17:10:53Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2875

adelejjeh · 2025-11-19T22:02:01Z

Closing since CI passed

adelejjeh · 2025-11-19T22:02:11Z

Upstream PR opened: llvm#168770

Change built-in codegen to remove contract from transcendetals

bccec0e

adelejjeh requested a review from carlobertolli November 17, 2025 22:11

carlobertolli marked this pull request as draft November 17, 2025 22:51

Update to only remove contract for log with AMDGCN

8f5642c

Add log10 and fix clang unit test

bc4abcf

adelejjeh added 2 commits November 19, 2025 10:59

fix typo

1c52b4b

Fix small issues

abc8175

adelejjeh added the testing only label Nov 19, 2025

adelejjeh marked this pull request as ready for review November 19, 2025 17:10

adelejjeh removed the request for review from carlobertolli November 19, 2025 17:10

adelejjeh closed this Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change built-in codegen to remove contract from transcendetals #610

Change built-in codegen to remove contract from transcendetals #610

Uh oh!

adelejjeh commented Nov 17, 2025 •

edited

Loading

Uh oh!

z1-cciauto commented Nov 17, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 19, 2025

Uh oh!

z1-cciauto commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Change built-in codegen to remove contract from transcendetals #610

Change built-in codegen to remove contract from transcendetals #610

Uh oh!

Conversation

adelejjeh commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Summary

Key Findings

Solution and Proposed Fix

Uh oh!

z1-cciauto commented Nov 17, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 19, 2025

Uh oh!

z1-cciauto commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adelejjeh commented Nov 17, 2025 •

edited

Loading