Skip to content

Conversation

@adelejjeh
Copy link
Contributor

@adelejjeh adelejjeh commented Nov 19, 2025

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

  • Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @__ocml_log_f32(float noundef %0) #23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
  %2 = tail call float @llvm.log.f32(float %0)
  ret float %2
}
  • New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@adelejjeh adelejjeh marked this pull request as ready for review November 19, 2025 20:44
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. labels Nov 19, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 19, 2025

@llvm/pr-subscribers-clang

Author: Adel Ejjeh (adelejjeh)

Changes

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

  • Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
  %2 = tail call float @<!-- -->llvm.log.f32(float %0)
  ret float %2
}
  • New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @<!-- -->llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.


Full diff: https://github.com/llvm/llvm-project/pull/168770.diff

2 Files Affected:

  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+25-1)
  • (modified) clang/test/Headers/__clang_hip_math.hip (+19-19)
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
   return Store;
 }
 
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+  switch (IntrinsicID) {
+    // The implementation for log in the AMDGCN backend uses a refinement algorithm
+    // that requires intermediate rounding. The contract flag
+    // would allow FMA formation that recomputes products, breaking the
+    // refinement algorithm.
+  case Intrinsic::log:
+  case Intrinsic::log10:
+    if ((CGF.getTarget().getTriple().isAMDGCN() ||
+         CGF.getTarget().getTriple().isSPIRV()) &&
+         CGF.getLangOpts().HIP)
+      return true;
+  default:
+    return false;
+  }
+}
 // Emit a simple mangled intrinsic that has 1 argument and a return type
 // matching the argument type. Depending on mode, this may be a constrained
 // floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
     return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
   } else {
     Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
-    return CGF.Builder.CreateCall(F, Src0);
+    CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+    
+    // Check if the intrinsic is unsafe to contract
+    if (isUnsafeToContract(IntrinsicID, CGF)) {
+      Call->setHasAllowContract(false);
+    }
+    
+    return Call;  
   }
 }
 
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
 // DEFAULT-LABEL: define dso_local noundef float @test___log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
 // DEFAULT-LABEL: define dso_local noundef float @test___logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___logf(float x) {

@llvmbot
Copy link
Member

llvmbot commented Nov 19, 2025

@llvm/pr-subscribers-clang-codegen

Author: Adel Ejjeh (adelejjeh)

Changes

Problem Summary

PyTorch's test_warp_softmax_64bit_indexing began failing after latest mainline promotion. The test failure manifested as a numerical precision error where log(1.1422761679) computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).

I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call __builtin_logf() directly instead of __ocml_log_f32():

- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }

This change exposed a bug with how Clang handles the contract fast-math flag on log intrinsics with AMDGCN target.

Key Findings

1. Contract flag propagation: When -ffp-contract=fast is enabled (default for HIP), Clang's CodeGen adds the contract flag to all CallInst instructions within the scope of CGFPOptionsRAII, including calls to LLVM intrinsics like llvm.log.f32.

2. Behavior change from OCML to builtin path:

  • Old path (via __ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the llvm.log.f32 intrinsic.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
  %2 = tail call float @<!-- -->llvm.log.f32(float %0)
  ret float %2
}
  • New path (via __builtin_logf): The call goes directly to llvm.log.f32 intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @<!-- -->llvm.log.f32(float %0)
  ret float %1
}

3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by ln(2), and applying some rounding error correction to that multiplication. This results in something like:

r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4

With the presence of the contract flag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:

r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1

Solution and Proposed Fix

Based on our implementation of llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation. contract on a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.

My proposed fix involves adding logic to CGBuiltin.cpp to explicitly disable the contract flag on the CallInst for the llvm.log intrinsic when the target is AMDGCN/HIP.


Full diff: https://github.com/llvm/llvm-project/pull/168770.diff

2 Files Affected:

  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+25-1)
  • (modified) clang/test/Headers/__clang_hip_math.hip (+19-19)
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
   return Store;
 }
 
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+  switch (IntrinsicID) {
+    // The implementation for log in the AMDGCN backend uses a refinement algorithm
+    // that requires intermediate rounding. The contract flag
+    // would allow FMA formation that recomputes products, breaking the
+    // refinement algorithm.
+  case Intrinsic::log:
+  case Intrinsic::log10:
+    if ((CGF.getTarget().getTriple().isAMDGCN() ||
+         CGF.getTarget().getTriple().isSPIRV()) &&
+         CGF.getLangOpts().HIP)
+      return true;
+  default:
+    return false;
+  }
+}
 // Emit a simple mangled intrinsic that has 1 argument and a return type
 // matching the argument type. Depending on mode, this may be a constrained
 // floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
     return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
   } else {
     Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
-    return CGF.Builder.CreateCall(F, Src0);
+    CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+    
+    // Check if the intrinsic is unsafe to contract
+    if (isUnsafeToContract(IntrinsicID, CGF)) {
+      Call->setHasAllowContract(false);
+    }
+    
+    return Call;  
   }
 }
 
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
 // DEFAULT-LABEL: define dso_local noundef float @test_logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test_logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test_logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
 // DEFAULT-LABEL: define dso_local noundef float @test___log10f(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___log10f(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___log10f(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
 // DEFAULT-LABEL: define dso_local noundef float @test___logf(
 // DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // DEFAULT-NEXT:  [[ENTRY:.*:]]
-// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // DEFAULT-NEXT:    ret float [[TMP0]]
 //
 // FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
 // FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // FINITEONLY-NEXT:  [[ENTRY:.*:]]
-// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT:    [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
 // FINITEONLY-NEXT:    ret float [[TMP0]]
 //
 // APPROX-LABEL: define dso_local noundef float @test___logf(
 // APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // APPROX-NEXT:  [[ENTRY:.*:]]
-// APPROX-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // APPROX-NEXT:    ret float [[TMP0]]
 //
 // NCRDIV-LABEL: define dso_local noundef float @test___logf(
 // NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
 // NCRDIV-NEXT:  [[ENTRY:.*:]]
-// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT:    [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
 // NCRDIV-NEXT:    ret float [[TMP0]]
 //
 // AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
 // AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
 // AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
 // AMDGCNSPIRV-NEXT:    ret float [[TMP0]]
 //
 extern "C" __device__ float test___logf(float x) {

@github-actions
Copy link

github-actions bot commented Nov 19, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@efriedma-quic
Copy link
Collaborator

I think I'd prefer to make llvm.log expansion ignore the "contract" flag, as opposed to suppressing the marking in the frontend. From your description, it sounds like the issue is specific to the exact way the backend is choosing to expand llvm.log.

@carlobertolli carlobertolli self-requested a review November 19, 2025 21:03
Copy link
Member

@carlobertolli carlobertolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's have others review as well.

@github-actions
Copy link

github-actions bot commented Nov 19, 2025

🐧 Linux x64 Test Results

  • 111347 tests passed
  • 4420 tests skipped

@adelejjeh
Copy link
Contributor Author

Forced-push to enable clang-format pre-commit hook

@adelejjeh
Copy link
Contributor Author

I think I'd prefer to make llvm.log expansion ignore the "contract" flag, as opposed to suppressing the marking in the frontend. From your description, it sounds like the issue is specific to the exact way the backend is choosing to expand llvm.log.

@efriedma-quic As I understand it though, the presence of the contract flag on a callinst indicates that it is legal to contract in the body of the called function. As such, since we know the AMDGCN back-end implementation for llvm.log doesn't allow contraction, isn't it semantically illegal to have the flag on the intrinsic?

@efriedma-quic
Copy link
Collaborator

As I understand it though, the presence of the contract flag on a callinst indicates that it is legal to contract in the body of the called function.

I don't think target-independent optimizations treat it that way. Like, the inliner doesn't propagate contract markings into the callee. The code that's propagating the contract marking for llvm.log is specifically in the amdgpu backend.

since we know the AMDGCN back-end implementation for llvm.log doesn't allow contraction, isn't it semantically illegal to have the flag on the intrinsic?

Having correlated behavior in different places introducing a continuing maintenance burden: with this patch, every frontend needs to know that it can't attach contract to llvm.log specifically on amdgpu. If you fix the backend, every frontend gets the fix, and the check is next to the code that actually does the expansion.

@adelejjeh
Copy link
Contributor Author

Thanks for the feedback @efriedma-quic . I have created a new PR with the BE (#168916) change and will close this one.

@adelejjeh adelejjeh closed this Nov 20, 2025
arsenm pushed a commit that referenced this pull request Dec 4, 2025
…168916)

## Problem Summary

PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical
precision error where `log(1.1422761679)` computed with 54% higher error
than expected (9.042e-09 vs 5.859e-09), causing gradient computations to
exceed tolerance thresholds. This precision degradation was reproducible
across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I
tracked down the problem to the commit **4703f8b6610a** (March 6, 2025)
which changed HIP math headers to call `__builtin_logf()` directly
instead of `__ocml_log_f32()`:

```diff
- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }
```

This change exposed a problem in the AMDGCN back-end as described below:

## Key Findings

**1. Contract flag propagation:** When `-ffp-contract=fast` is enabled
(default for HIP), Clang's CodeGen adds the `contract` flag to all
`CallInst` instructions within the scope of `CGFPOptionsRAII`, including
calls to LLVM intrinsics like `llvm.log.f32`.

**2. Behavior change from OCML to builtin path:**
- **Old path** (via `__ocml_log_f32`): The preprocessed IR showed the
call to the OCML library function had the contract flag, but the OCML
implementation internally dropped the contract flag when calling the
`llvm.log.f32` intrinsic.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
  %call = call contract float @__ocml_log_f32(float noundef %0) #23
  ret float %call
}

; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
  %2 = tail call float @llvm.log.f32(float %0)
  ret float %2
}
```
- **New path** (via `__builtin_logf`): The call goes directly to
`llvm.log.f32` intrinsic with the contract flag preserved, causing the
backend to apply FMA contraction during polynomial expansion.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
  %retval = alloca float, align 4, addrspace(5)
  %__x.addr = alloca float, align 4, addrspace(5)
  %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
  %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
  store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
  %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
  %1 = call contract float @llvm.log.f32(float %0)
  ret float %1
}
```
**3. Why contract breaks log:** Our AMDGCM target back end implements
the natural logarithm by taking the result of the hardware log, then
multiplying that by `ln(2)`, and applying some rounding error correction
to that multiplication. This results in something like:
```c
r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
```
```asm
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v4, v1, s2, -v3
  v_fmac_f32_e32 v4, 0x3377d1cf, v1
  v_add_f32_e32 v3, v3, v4
```
With the presence of the `contract` flag, the back-end fuses the add (`r
+ Z`) with the multiply thinking that it is legal, thus eliminating the
intermediate rounding. The error compensation term, which was calculated
based on the rounded product, is now being added to the full-precision
result from the FMA, leading to incorrect error correction and degraded
accuracy. The corresponding contracted operations become the following:
```c
r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
```
```asm
  v_log_f32_e32 v1, v1
  s_mov_b32 s2, 0x3f317217
  v_mul_f32_e32 v3, 0x3f317217, v1
  v_fma_f32 v3, v1, s2, -v3
  v_fmac_f32_e32 v3, 0x3377d1cf, v1
  v_fmac_f32_e32 v3, 0x3f317217, v1
```

## Solution and Proposed Fix

Based on our implementation of `llvm.log` and `llvm.log10`, it should be
illegal for the back-end to propagate the `contract` flag when it is
present on the intrinsic call because it uses error-correcting
summation. My proposed fix is to modify the instruction selection passes
(both global-isel and sdag) to drop the `contract` flag when lowering
llvm.log. That way, when the instruction selection performs the
contraction optimization, it will not fuse the multiply and add.

Note: I had originally implemented this fix in the FE by removing the
`contract` flag when lowering the llvm.log builtin (PR #168770). I have
since closed that PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants