Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (microsoft#25669)

Rohanjames1997 · web-flow · commit af4bf436d70f · 2025-09-02T09:41:52.000-07:00
### Description The [vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32) intrinsic compiles to the [FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en) instruction which is more performant than separate `fmul`+`fadd` instructions that [vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32) compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA instructions already on the latest clang compilers (which are the default for MacOS ORT builds already) ### Motivation and Context With this change, the NEON version of `MlasMultiplyAddFloat32x4` achieves parity with the x86 version that uses `_mm_fmadd_ps`. It also achieves up to ~15% speedups compared to the current `vmlaq_f32` implementation when tested on top of microsoft#25580
diff --git a/onnxruntime/core/mlas/lib/mlasi.h b/onnxruntime/core/mlas/lib/mlasi.h
@@ -2280,7 +2280,7 @@ MLAS_FLOAT32X4
 MlasMultiplyAddFloat32x4(MLAS_FLOAT32X4 Vector1, MLAS_FLOAT32X4 Vector2, MLAS_FLOAT32X4 Vector3)
 {
 #if defined(MLAS_NEON_INTRINSICS)
-    return vmlaq_f32(Vector3, Vector1, Vector2);
+    return vfmaq_f32(Vector3, Vector1, Vector2);
 #elif defined(MLAS_FMA3_INTRINSICS)
     return _mm_fmadd_ps(Vector1, Vector2, Vector3);
 #elif defined(MLAS_SSE2_INTRINSICS)

Original file line number	Diff line number	Diff line change
`@@ -2280,7 +2280,7 @@ MLAS_FLOAT32X4`
`2280`	`2280`	`MlasMultiplyAddFloat32x4(MLAS_FLOAT32X4 Vector1, MLAS_FLOAT32X4 Vector2, MLAS_FLOAT32X4 Vector3)`
`2281`	`2281`	`{`
`2282`	`2282`	`#if defined(MLAS_NEON_INTRINSICS)`
`2283`		`- return vmlaq_f32(Vector3, Vector1, Vector2);`
	`2283`	`+ return vfmaq_f32(Vector3, Vector1, Vector2);`
`2284`	`2284`	`#elif defined(MLAS_FMA3_INTRINSICS)`
`2285`	`2285`	`return _mm_fmadd_ps(Vector1, Vector2, Vector3);`
`2286`	`2286`	`#elif defined(MLAS_SSE2_INTRINSICS)`