[XPU] Fix precision for paddle.Tensor.bmm#78951
Open
YqGe585 wants to merge 1 commit into
Open
Conversation
… for float32 accumulation GPU uses CUBLAS_COMPUTE_32F (full fp32) for float32 bmm, while XPU defaults to FC_TF32 (tfloat32 with only 10 mantissa bits), causing precision discrepancies that scale with matrix dimensions. Override FC_TF32 to FC_FLOAT for float32 bmm in forward, backward, and batched FC paths to match GPU precision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
你的PR提交成功,感谢你对开源项目的贡献! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
On XRE5 hardware, the XPU bmm kernel defaults to FC_TF32 (tfloat32 with only 10 mantissa bits) for float32 inputs, while GPU uses CUBLAS_COMPUTE_32F (full fp32 with 23 mantissa bits). This causes precision discrepancies that scale with matrix dimensions — the original failing case had max_abs_diff=0.000183254 and max_rel_diff=0.394162.
Fix
Override FCCalcType from FC_TF32 to FC_FLOAT when the input type is float32, in the bmm forward kernel, backward kernel, and batched FC utility path. This ensures full fp32 accumulation matching GPU behavior.
Modified files
paddle/phi/kernels/xpu/bmm_kernel.cc— Forward kernel overridepaddle/phi/kernels/xpu/bmm_grad_kernel.cc— Backward kernel overridepaddle/phi/kernels/xpu/bmm_xpu_utils.h— Batched FC utility overrideVerification
All 19 test cases from all_config.txt now pass with max_abs_diff in range 1.04e-07 to 2.98e-07 (well within atol=1e-4 tolerance).
Does this PR introduce a precision change?
Yes — XPU precision corrected to align with GPU (TF32 accumulation → full fp32 accumulation for bmm float32).