Change built-in codegen to remove contract from transcendetals #610
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is intended for testing only before upstream PR gets opened.
Problem Summary
PyTorch's
test_warp_softmax_64bit_indexingbegan failing after latest mainline promotion. The test failure manifested as a numerical precision error wherelog(1.1422761679)computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950).I tracked down the problem to the upstream commit 4703f8b (March 6, 2025) titled "clang/HIP: Use generic builtins for f32 exp and log (llvm#129638)". This commit changed HIP math headers to call
__builtin_logf()directly instead of__ocml_log_f32():This change exposed a bug with how Clang handles the
contractfast-math flag on log intrinsics with AMDGCN target.Key Findings
1. Contract flag propagation: When
-ffp-contract=fastis enabled (default for HIP), Clang's CodeGen adds thecontractflag to allCallInstinstructions within the scope ofCGFPOptionsRAII, including calls to LLVM intrinsics likellvm.log.f32.2. Behavior change from OCML to builtin path:
__ocml_log_f32): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling thellvm.log.f32intrinsic.__builtin_logf): The call goes directly tollvm.log.f32intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion.3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiply that by
ln(2), and apply some rounding error correction to that multiplication. This results in something like:With the presence of the
contractflag, the back-end fuses the add (r + Z) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following:Solution and Proposed Fix
Based on our implementation of log), it should be illegal to add the contract flag to the intrinsic call because it uses error-correcting summation.
contracton a callinst indicates that it is legal to propagate the flag to the internals of the called function, but in this case that is not true since as described above the error-correcting summation we use doesn't allow for contraction.My proposed fix involves adding logic to
CGBuiltin.cppto explicitly disable thecontractflag on theCallInstfor the llvm.log intrinsic when the target is AMDGCN/HIP.This ensures transcendental intrinsics never have the contract flag, preventing incorrect FMA formation during polynomial expansion regardless of
-ffp-contractsettings.