-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU] Update log lowering to remove contract for AMDGCN backend #168770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-clang Author: Adel Ejjeh (adelejjeh) ChangesProblem SummaryPyTorch's I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }This change exposed a bug with how Clang handles the Key Findings1. Contract flag propagation: When 2. Behavior change from OCML to builtin path:
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
%call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
ret float %call
}
; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
%2 = tail call float @<!-- -->llvm.log.f32(float %0)
ret float %2
}
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
%1 = call contract float @<!-- -->llvm.log.f32(float %0)
ret float %1
}3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v4, v1, s2, -v3
v_fmac_f32_e32 v4, 0x3377d1cf, v1
v_add_f32_e32 v3, v3, v4With the presence of the r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v3, v1, s2, -v3
v_fmac_f32_e32 v3, 0x3377d1cf, v1
v_fmac_f32_e32 v3, 0x3f317217, v1Solution and Proposed FixBased on our implementation of My proposed fix involves adding logic to Full diff: https://github.com/llvm/llvm-project/pull/168770.diff 2 Files Affected:
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
return Store;
}
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+ switch (IntrinsicID) {
+ // The implementation for log in the AMDGCN backend uses a refinement algorithm
+ // that requires intermediate rounding. The contract flag
+ // would allow FMA formation that recomputes products, breaking the
+ // refinement algorithm.
+ case Intrinsic::log:
+ case Intrinsic::log10:
+ if ((CGF.getTarget().getTriple().isAMDGCN() ||
+ CGF.getTarget().getTriple().isSPIRV()) &&
+ CGF.getLangOpts().HIP)
+ return true;
+ default:
+ return false;
+ }
+}
// Emit a simple mangled intrinsic that has 1 argument and a return type
// matching the argument type. Depending on mode, this may be a constrained
// floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
} else {
Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
- return CGF.Builder.CreateCall(F, Src0);
+ CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+
+ // Check if the intrinsic is unsafe to contract
+ if (isUnsafeToContract(IntrinsicID, CGF)) {
+ Call->setHasAllowContract(false);
+ }
+
+ return Call;
}
}
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
// DEFAULT-LABEL: define dso_local noundef float @test_log10f(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test_log10f(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test_log10f(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
// DEFAULT-LABEL: define dso_local noundef float @test_logf(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test_logf(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test_logf(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
// DEFAULT-LABEL: define dso_local noundef float @test___log10f(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test___log10f(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test___log10f(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
// DEFAULT-LABEL: define dso_local noundef float @test___logf(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test___logf(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test___logf(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test___logf(float x) {
|
|
@llvm/pr-subscribers-clang-codegen Author: Adel Ejjeh (adelejjeh) ChangesProblem SummaryPyTorch's I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }This change exposed a bug with how Clang handles the Key Findings1. Contract flag propagation: When 2. Behavior change from OCML to builtin path:
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
%call = call contract float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->23
ret float %call
}
; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @<!-- -->__ocml_log_f32(float noundef %0) #<!-- -->7 {
%2 = tail call float @<!-- -->llvm.log.f32(float %0)
ret float %2
}
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @<!-- -->_ZL4logff(float noundef %__x) #<!-- -->6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
%1 = call contract float @<!-- -->llvm.log.f32(float %0)
ret float %1
}3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v4, v1, s2, -v3
v_fmac_f32_e32 v4, 0x3377d1cf, v1
v_add_f32_e32 v3, v3, v4With the presence of the r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v3, v1, s2, -v3
v_fmac_f32_e32 v3, 0x3377d1cf, v1
v_fmac_f32_e32 v3, 0x3f317217, v1Solution and Proposed FixBased on our implementation of My proposed fix involves adding logic to Full diff: https://github.com/llvm/llvm-project/pull/168770.diff 2 Files Affected:
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 3079f8ab7229e..3cf9be8e70b57 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -582,6 +582,23 @@ static Value *EmitISOVolatileStore(CodeGenFunction &CGF, const CallExpr *E) {
return Store;
}
+// Check if an intrinsic is a transcendental function that is unsafe to contract.
+static bool isUnsafeToContract(unsigned IntrinsicID, CodeGenFunction &CGF) {
+ switch (IntrinsicID) {
+ // The implementation for log in the AMDGCN backend uses a refinement algorithm
+ // that requires intermediate rounding. The contract flag
+ // would allow FMA formation that recomputes products, breaking the
+ // refinement algorithm.
+ case Intrinsic::log:
+ case Intrinsic::log10:
+ if ((CGF.getTarget().getTriple().isAMDGCN() ||
+ CGF.getTarget().getTriple().isSPIRV()) &&
+ CGF.getLangOpts().HIP)
+ return true;
+ default:
+ return false;
+ }
+}
// Emit a simple mangled intrinsic that has 1 argument and a return type
// matching the argument type. Depending on mode, this may be a constrained
// floating-point intrinsic.
@@ -596,7 +613,14 @@ Value *emitUnaryMaybeConstrainedFPBuiltin(CodeGenFunction &CGF,
return CGF.Builder.CreateConstrainedFPCall(F, { Src0 });
} else {
Function *F = CGF.CGM.getIntrinsic(IntrinsicID, Src0->getType());
- return CGF.Builder.CreateCall(F, Src0);
+ CallInst *Call = CGF.Builder.CreateCall(F, Src0);
+
+ // Check if the intrinsic is unsafe to contract
+ if (isUnsafeToContract(IntrinsicID, CGF)) {
+ Call->setHasAllowContract(false);
+ }
+
+ return Call;
}
}
diff --git a/clang/test/Headers/__clang_hip_math.hip b/clang/test/Headers/__clang_hip_math.hip
index 7e2691633c215..aa97fc84f0904 100644
--- a/clang/test/Headers/__clang_hip_math.hip
+++ b/clang/test/Headers/__clang_hip_math.hip
@@ -3673,31 +3673,31 @@ extern "C" __device__ long long int test_llround(double x) {
// DEFAULT-LABEL: define dso_local noundef float @test_log10f(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_log10f(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test_log10f(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test_log10f(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test_log10f(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test_log10f(float x) {
@@ -3945,25 +3945,25 @@ extern "C" __device__ double test_logb(double x) {
// DEFAULT-LABEL: define dso_local noundef float @test_logf(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test_logf(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test_logf(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test_logf(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test_logf(
@@ -8600,31 +8600,31 @@ extern "C" __device__ float test___fsub_rn(float x, float y) {
// DEFAULT-LABEL: define dso_local noundef float @test___log10f(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___log10f(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log10.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test___log10f(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test___log10f(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log10.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log10.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test___log10f(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log10.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log10.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test___log10f(float x) {
@@ -8668,31 +8668,31 @@ extern "C" __device__ float test___log2f(float x) {
// DEFAULT-LABEL: define dso_local noundef float @test___logf(
// DEFAULT-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// DEFAULT-NEXT: [[ENTRY:.*:]]
-// DEFAULT-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// DEFAULT-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// DEFAULT-NEXT: ret float [[TMP0]]
//
// FINITEONLY-LABEL: define dso_local noundef nofpclass(nan inf) float @test___logf(
// FINITEONLY-SAME: float noundef nofpclass(nan inf) [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// FINITEONLY-NEXT: [[ENTRY:.*:]]
-// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf contract noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
+// FINITEONLY-NEXT: [[TMP0:%.*]] = tail call nnan ninf noundef float @llvm.log.f32(float nofpclass(nan inf) [[X]])
// FINITEONLY-NEXT: ret float [[TMP0]]
//
// APPROX-LABEL: define dso_local noundef float @test___logf(
// APPROX-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// APPROX-NEXT: [[ENTRY:.*:]]
-// APPROX-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// APPROX-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// APPROX-NEXT: ret float [[TMP0]]
//
// NCRDIV-LABEL: define dso_local noundef float @test___logf(
// NCRDIV-SAME: float noundef [[X:%.*]]) local_unnamed_addr #[[ATTR3]] {
// NCRDIV-NEXT: [[ENTRY:.*:]]
-// NCRDIV-NEXT: [[TMP0:%.*]] = tail call contract noundef float @llvm.log.f32(float [[X]])
+// NCRDIV-NEXT: [[TMP0:%.*]] = tail call noundef float @llvm.log.f32(float [[X]])
// NCRDIV-NEXT: ret float [[TMP0]]
//
// AMDGCNSPIRV-LABEL: define spir_func noundef float @test___logf(
// AMDGCNSPIRV-SAME: float noundef [[X:%.*]]) local_unnamed_addr addrspace(4) #[[ATTR3]] {
// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
-// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call contract noundef addrspace(4) float @llvm.log.f32(float [[X]])
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = tail call noundef addrspace(4) float @llvm.log.f32(float [[X]])
// AMDGCNSPIRV-NEXT: ret float [[TMP0]]
//
extern "C" __device__ float test___logf(float x) {
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
I think I'd prefer to make llvm.log expansion ignore the "contract" flag, as opposed to suppressing the marking in the frontend. From your description, it sounds like the issue is specific to the exact way the backend is choosing to expand llvm.log. |
carlobertolli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's have others review as well.
ea795b9 to
9880236
Compare
🐧 Linux x64 Test Results
|
|
Forced-push to enable clang-format pre-commit hook |
9880236 to
a727ac7
Compare
@efriedma-quic As I understand it though, the presence of the contract flag on a callinst indicates that it is legal to contract in the body of the called function. As such, since we know the AMDGCN back-end implementation for llvm.log doesn't allow contraction, isn't it semantically illegal to have the flag on the intrinsic? |
a727ac7 to
925df13
Compare
I don't think target-independent optimizations treat it that way. Like, the inliner doesn't propagate contract markings into the callee. The code that's propagating the contract marking for llvm.log is specifically in the amdgpu backend.
Having correlated behavior in different places introducing a continuing maintenance burden: with this patch, every frontend needs to know that it can't attach contract to llvm.log specifically on amdgpu. If you fix the backend, every frontend gets the fix, and the check is next to the code that actually does the expansion. |
|
Thanks for the feedback @efriedma-quic . I have created a new PR with the BE (#168916) change and will close this one. |
…168916) ## Problem Summary PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical precision error where `log(1.1422761679)` computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I tracked down the problem to the commit **4703f8b6610a** (March 6, 2025) which changed HIP math headers to call `__builtin_logf()` directly instead of `__ocml_log_f32()`: ```diff - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); } + float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); } ``` This change exposed a problem in the AMDGCN back-end as described below: ## Key Findings **1. Contract flag propagation:** When `-ffp-contract=fast` is enabled (default for HIP), Clang's CodeGen adds the `contract` flag to all `CallInst` instructions within the scope of `CGFPOptionsRAII`, including calls to LLVM intrinsics like `llvm.log.f32`. **2. Behavior change from OCML to builtin path:** - **Old path** (via `__ocml_log_f32`): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the `llvm.log.f32` intrinsic. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23 %call = call contract float @__ocml_log_f32(float noundef %0) #23 ret float %call } ; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) define internal noundef float @__ocml_log_f32(float noundef %0) #7 { %2 = tail call float @llvm.log.f32(float %0) ret float %2 } ``` - **New path** (via `__builtin_logf`): The call goes directly to `llvm.log.f32` intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24 %1 = call contract float @llvm.log.f32(float %0) ret float %1 } ``` **3. Why contract breaks log:** Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by `ln(2)`, and applying some rounding error correction to that multiplication. This results in something like: ```c r = y * c1; // y is result of v_log_ instruction, c1 = ln(2) r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v4, v1, s2, -v3 v_fmac_f32_e32 v4, 0x3377d1cf, v1 v_add_f32_e32 v3, v3, v4 ``` With the presence of the `contract` flag, the back-end fuses the add (`r + Z`) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following: ```c r = y * c1; r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v3, v1, s2, -v3 v_fmac_f32_e32 v3, 0x3377d1cf, v1 v_fmac_f32_e32 v3, 0x3f317217, v1 ``` ## Solution and Proposed Fix Based on our implementation of `llvm.log` and `llvm.log10`, it should be illegal for the back-end to propagate the `contract` flag when it is present on the intrinsic call because it uses error-correcting summation. My proposed fix is to modify the instruction selection passes (both global-isel and sdag) to drop the `contract` flag when lowering llvm.log. That way, when the instruction selection performs the contraction optimization, it will not fuse the multiply and add. Note: I had originally implemented this fix in the FE by removing the `contract` flag when lowering the llvm.log builtin (PR #168770). I have since closed that PR.
Problem Summary
PyTorch's
test_warp_softmax_64bit_indexingbegan failing after latest mainline promotion. The test failure manifested as a numerical precision error wherelog(1.1422761679)computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (#129638)". This commit changed HIP math headers to call
__builtin_logf()directly instead of__ocml_log_f32():This change exposed a bug with how Clang handles the
contractfast-math flag on log intrinsics with AMDGCN target.Key Findings
1. Contract flag propagation: When
-ffp-contract=fastis enabled (default for HIP), Clang's CodeGen adds thecontractflag to allCallInstinstructions within the scope ofCGFPOptionsRAII, including calls to LLVM intrinsics likellvm.log.f32.2. Behavior change from OCML to builtin path:
__ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling thellvm.log.f32intrinsic.__builtin_logf): The call goes directly tollvm.log.f32intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by
ln(2), and applying some rounding error correction to that multiplication. This results in something like:With the presence of the
contractflag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:Solution and Proposed Fix
Based on our implementation of
llvm.log, it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation.contracton a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.My proposed fix involves adding logic to
CGBuiltin.cppto explicitly disable thecontractflag on theCallInstfor the llvm.log intrinsic when the target is AMDGCN/HIP.