Skip to content

Commit af4bf43

Browse files
Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (microsoft#25669)
### Description The [vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32) intrinsic compiles to the [FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en) instruction which is more performant than separate `fmul`+`fadd` instructions that [vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32) compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA instructions already on the latest clang compilers (which are the default for MacOS ORT builds already) ### Motivation and Context With this change, the NEON version of `MlasMultiplyAddFloat32x4` achieves parity with the x86 version that uses `_mm_fmadd_ps`. It also achieves up to ~15% speedups compared to the current `vmlaq_f32` implementation when tested on top of microsoft#25580
1 parent 5746ba9 commit af4bf43

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

onnxruntime/core/mlas/lib/mlasi.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2280,7 +2280,7 @@ MLAS_FLOAT32X4
22802280
MlasMultiplyAddFloat32x4(MLAS_FLOAT32X4 Vector1, MLAS_FLOAT32X4 Vector2, MLAS_FLOAT32X4 Vector3)
22812281
{
22822282
#if defined(MLAS_NEON_INTRINSICS)
2283-
return vmlaq_f32(Vector3, Vector1, Vector2);
2283+
return vfmaq_f32(Vector3, Vector1, Vector2);
22842284
#elif defined(MLAS_FMA3_INTRINSICS)
22852285
return _mm_fmadd_ps(Vector1, Vector2, Vector3);
22862286
#elif defined(MLAS_SSE2_INTRINSICS)

0 commit comments

Comments
 (0)