[AMDGPU] Update log lowering to remove contract for AMDGCN backend #168770

adelejjeh · 2025-11-19T20:36:58Z

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @__ocml_log_f32(float noundef %0) #23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
  %2 = tail call float @llvm.log.f32(float %0)
  ret float %2
}

New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.

github-actions · 2025-11-19T20:37:21Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-11-19T20:45:07Z

@llvm/pr-subscribers-clang

Author: Adel Ejjeh (adelejjeh)

Changes

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
  %2 = tail call float @<!-- -->llvm.log.f32(float %0)
  ret float %2
}

New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @<!-- -->llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.

Full diff: https://github.com/llvm/llvm-project/pull/168770.diff

2 Files Affected:

(modified) clang/lib/CodeGen/CGBuiltin.cpp (+25-1)
(modified) clang/test/Headers/__clang_hip_math.hip (+19-19)

diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
   return Store;
 }
 
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+  switch (IntrinsicID) {
+    // The implementation for log in the AMDGCN backend uses a refinement algorithm
+    // that requires intermediate rounding. The contract flag
+    // would allow FMA formation that recomputes products, breaking the
+    // refinement algorithm.
+  case Intrinsic::log:
+  case Intrinsic::log10:
+    if ((CGF.getTarget().getTriple().isAMDGCN() ||
+         CGF.getTarget().getTriple().isSPIRV()) &&
+         CGF.getLangOpts().HIP)
+      return true;
+  default:
+    return false;
+  }
+}
 // Emit a simple mangled intrinsic that has 1 argument and a return type
 // matching the argument type. Depending on mode, this may be a constrained
 // floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
     return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
   } else {
     Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
-    return CGF.Builder.CreateCall(F, Src0);
+    CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+    
+    // Check if the intrinsic is unsafe to contract
+    if (isUnsafeToContract(IntrinsicID, CGF)) {
+      Call->setHasAllowContract(false);
+    }
+    
+    return Call;  
   }
 }
 
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
 // DEFAULT-LABEL: define dso_local noundef float @test___log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
 // DEFAULT-LABEL: define dso_local noundef float @test___logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___logf(float x) {

llvmbot · 2025-11-19T20:45:08Z

@llvm/pr-subscribers-clang-codegen

Author: Adel Ejjeh (adelejjeh)

Changes

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
  %2 = tail call float @<!-- -->llvm.log.f32(float %0)
  ret float %2
}

New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.

; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @<!-- -->llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));

  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.

Full diff: https://github.com/llvm/llvm-project/pull/168770.diff

2 Files Affected:

(modified) clang/lib/CodeGen/CGBuiltin.cpp (+25-1)
(modified) clang/test/Headers/__clang_hip_math.hip (+19-19)

diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
   return Store;
 }
 
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+  switch (IntrinsicID) {
+    // The implementation for log in the AMDGCN backend uses a refinement algorithm
+    // that requires intermediate rounding. The contract flag
+    // would allow FMA formation that recomputes products, breaking the
+    // refinement algorithm.
+  case Intrinsic::log:
+  case Intrinsic::log10:
+    if ((CGF.getTarget().getTriple().isAMDGCN() ||
+         CGF.getTarget().getTriple().isSPIRV()) &&
+         CGF.getLangOpts().HIP)
+      return true;
+  default:
+    return false;
+  }
+}
 // Emit a simple mangled intrinsic that has 1 argument and a return type
 // matching the argument type. Depending on mode, this may be a constrained
 // floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
     return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
   } else {
     Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
-    return CGF.Builder.CreateCall(F, Src0);
+    CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+    
+    // Check if the intrinsic is unsafe to contract
+    if (isUnsafeToContract(IntrinsicID, CGF)) {
+      Call->setHasAllowContract(false);
+    }
+    
+    return Call;  
   }
 }
 
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
 // DEFAULT-LABEL: define dso_local noundef float @test___log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
 // DEFAULT-LABEL: define dso_local noundef float @test___logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___logf(float x) {

github-actions · 2025-11-19T21:01:31Z

✅ With the latest revision this PR passed the C/C++ code formatter.

efriedma-quic · 2025-11-19T21:02:35Z

I think I'd prefer to make llvm.log expansion ignore the "contract" flag, as opposed to suppressing the marking in the frontend. From your description, it sounds like the issue is specific to the exact way the backend is choosing to expand llvm.log.

carlobertolli

LGTM, let's have others review as well.

github-actions · 2025-11-19T21:20:43Z

🐧 Linux x64 Test Results

111347 tests passed
4420 tests skipped

adelejjeh · 2025-11-19T21:21:10Z

Forced-push to enable clang-format pre-commit hook

adelejjeh · 2025-11-19T21:33:05Z

I think I'd prefer to make llvm.log expansion ignore the "contract" flag, as opposed to suppressing the marking in the frontend. From your description, it sounds like the issue is specific to the exact way the backend is choosing to expand llvm.log.

@efriedma-quic As I understand it though, the presence of the contract flag on a callinst indicates that it is legal to contract in the body of the called function. As such, since we know the AMDGCN back-end implementation for llvm.log doesn't allow contraction, isn't it semantically illegal to have the flag on the intrinsic?

efriedma-quic · 2025-11-19T23:07:58Z

As I understand it though, the presence of the contract flag on a callinst indicates that it is legal to contract in the body of the called function.

I don't think target-independent optimizations treat it that way. Like, the inliner doesn't propagate contract markings into the callee. The code that's propagating the contract marking for llvm.log is specifically in the amdgpu backend.

since we know the AMDGCN back-end implementation for llvm.log doesn't allow contraction, isn't it semantically illegal to have the flag on the intrinsic?

Having correlated behavior in different places introducing a continuing maintenance burden: with this patch, every frontend needs to know that it can't attach contract to llvm.log specifically on amdgpu. If you fix the backend, every frontend gets the fix, and the check is next to the code that actually does the expansion.

adelejjeh · 2025-11-20T17:35:40Z

Thanks for the feedback @efriedma-quic . I have created a new PR with the BE (#168916) change and will close this one.

…168916) ## Problem Summary PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical precision error where `log(1.1422761679)` computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I tracked down the problem to the commit **4703f8b6610a** (March 6, 2025) which changed HIP math headers to call `__builtin_logf()` directly instead of `__ocml_log_f32()`: ```diff - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); } + float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); } ``` This change exposed a problem in the AMDGCN back-end as described below: ## Key Findings **1. Contract flag propagation:** When `-ffp-contract=fast` is enabled (default for HIP), Clang's CodeGen adds the `contract` flag to all `CallInst` instructions within the scope of `CGFPOptionsRAII`, including calls to LLVM intrinsics like `llvm.log.f32`. **2. Behavior change from OCML to builtin path:** - **Old path** (via `__ocml_log_f32`): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the `llvm.log.f32` intrinsic. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23 %call = call contract float @__ocml_log_f32(float noundef %0) #23 ret float %call } ; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) define internal noundef float @__ocml_log_f32(float noundef %0) #7 { %2 = tail call float @llvm.log.f32(float %0) ret float %2 } ``` - **New path** (via `__builtin_logf`): The call goes directly to `llvm.log.f32` intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24 %1 = call contract float @llvm.log.f32(float %0) ret float %1 } ``` **3. Why contract breaks log:** Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by `ln(2)`, and applying some rounding error correction to that multiplication. This results in something like: ```c r = y * c1; // y is result of v_log_ instruction, c1 = ln(2) r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v4, v1, s2, -v3 v_fmac_f32_e32 v4, 0x3377d1cf, v1 v_add_f32_e32 v3, v3, v4 ``` With the presence of the `contract` flag, the back-end fuses the add (`r + Z`) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following: ```c r = y * c1; r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v3, v1, s2, -v3 v_fmac_f32_e32 v3, 0x3377d1cf, v1 v_fmac_f32_e32 v3, 0x3f317217, v1 ``` ## Solution and Proposed Fix Based on our implementation of `llvm.log` and `llvm.log10`, it should be illegal for the back-end to propagate the `contract` flag when it is present on the intrinsic call because it uses error-correcting summation. My proposed fix is to modify the instruction selection passes (both global-isel and sdag) to drop the `contract` flag when lowering llvm.log. That way, when the instruction selection performs the contraction optimization, it will not fuse the multiply and add. Note: I had originally implemented this fix in the FE by removing the `contract` flag when lowering the llvm.log builtin (PR #168770). I have since closed that PR.

adelejjeh marked this pull request as ready for review November 19, 2025 20:44

llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. labels Nov 19, 2025

macurtis-amd requested review from alexey-bataev, b-sumner and yxsamliu November 19, 2025 20:50

carlobertolli self-requested a review November 19, 2025 21:03

carlobertolli approved these changes Nov 19, 2025

View reviewed changes

adelejjeh force-pushed the fix_contract_on_log branch from ea795b9 to 9880236 Compare November 19, 2025 21:20

adelejjeh force-pushed the fix_contract_on_log branch from 9880236 to a727ac7 Compare November 19, 2025 21:28

[AMDGPU] Update log lowering to remove contract for AMDGCN backend

925df13

adelejjeh force-pushed the fix_contract_on_log branch from a727ac7 to 925df13 Compare November 19, 2025 21:56

adelejjeh mentioned this pull request Nov 19, 2025

Change built-in codegen to remove contract from transcendetals ROCm/llvm-project#610

Closed

adelejjeh mentioned this pull request Nov 20, 2025

[AMDGPU] Update log lowering to remove contract for AMDGCN backend #168916

Merged

adelejjeh closed this Nov 20, 2025

[AMDGPU] Update log lowering to remove contract for AMDGCN backend #168770

[AMDGPU] Update log lowering to remove contract for AMDGCN backend #168770

Conversation

adelejjeh commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Summary

Key Findings

Solution and Proposed Fix

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

llvmbot commented Nov 19, 2025

Problem Summary

Key Findings

Solution and Proposed Fix

Uh oh!

llvmbot commented Nov 19, 2025

Problem Summary

Key Findings

Solution and Proposed Fix

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efriedma-quic commented Nov 19, 2025

Uh oh!

carlobertolli left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 19, 2025

Uh oh!

efriedma-quic commented Nov 19, 2025

Uh oh!

adelejjeh commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adelejjeh commented Nov 19, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited

Loading