Add support for the Brain 16-bit floating-point vector multiplication intrinsics

amilendra · amilendra · commit cc8f7353426d · 2025-09-23T17:37:41.000+01:00
Adds intrinsic support for the Brain 16-bit floating-point vector multiplication
instructions introduced by the FEAT_SVE_BFSCALE feature in 2024 dpISA.

BFSCALE: BFloat16 adjust exponent by vector (predicated)
BFSCALE (multiple and single vector): Multi-vector BFloat16 adjust exponent by vector
BFSCALE (multiple vectors): Multi-vector BFloat16 adjust exponent
BFMUL (multiple and single vector): Multi-vector BFloat16 floating-point multiply by vector
BFMUL (multiple vectors): Multi-vector BFloat16 floating-point multiply
diff --git a/main/acle.md b/main/acle.md
@@ -468,6 +468,7 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
 * Added support for FEAT_FPRCVT intrinsics and `__ARM_FEATURE_FPRCVT`.
 * Added support for modal 8-bit floating point matrix multiply-add widening intrinsics.
 * Added support for 16-bit floating point matrix multiply-add widening intrinsics.
+* Added support for Brain 16-bit floating-point vector multiplication intrinsics.
 
 ### References
 
@@ -2003,6 +2004,7 @@ of SME has an associated preprocessor macro, given in the table below:
 | FEAT_SME    | __ARM_FEATURE_SME          |
 | FEAT_SME2   | __ARM_FEATURE_SME2         |
 | FEAT_SME2p1 | __ARM_FEATURE_SME2p1       |
+| FEAT_SME2p2 | __ARM_FEATURE_SME2p2       |
 
 Each macro is defined if there is hardware support for the associated
 architecture feature and if all of the [ACLE
@@ -2125,6 +2127,16 @@ are available.  Specifically, if this macro is defined to `1`, then:
 for the FEAT_SME_B16B16 instructions and if their associated intrinsics
 are available.
 
+#### Brain 16-bit floating-point vector multiplication support
+
+`__ARM_FEATURE_SVE_BFSCALE` is defined to `1` if there is hardware
+support for the SVE BF16 vector multiplication extensions and if the
+associated ACLE intrinsics are available.
+
+See [Half-precision brain
+floating-point](#half-precision-brain-floating-point) for details
+of half-precision brain floating-point types.
+
 ### Cryptographic extensions
 
 #### “Crypto” extension
@@ -2666,6 +2678,7 @@ be found in [[BA]](#BA).
 | [`__ARM_FEATURE_SVE`](#scalable-vector-extension-sve)                                                                                                   | Scalable Vector Extension (FEAT_SVE)                                                               | 1           |
 | [`__ARM_FEATURE_SVE_B16B16`](#non-widening-brain-16-bit-floating-point-support)                                                                         | Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16)                              | 1           |
 | [`__ARM_FEATURE_SVE_BF16`](#brain-16-bit-floating-point-support)                                                                                        | SVE support for the 16-bit brain floating-point extension (FEAT_BF16)                              | 1           |
+| [`__ARM_FEATURE_SVE_BFSCALE`](#brain-16-bit-floating-point-vector-multiplication-support)                                                               | SVE support for the 16-bit brain floating-point vector multiplication extension (FEAT_SVE_BFSCALE) | 1           |
 | [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve)                                                                                              | The number of bits in an SVE vector, when known in advance                                         | 256         |
 | [`__ARM_FEATURE_SVE_MATMUL_FP32`](#multiplication-of-32-bit-floating-point-matrices)                                                                    | 32-bit floating-point matrix multiply extension (FEAT_F32MM)                                       | 1           |
 | [`__ARM_FEATURE_SVE_MATMUL_FP64`](#multiplication-of-64-bit-floating-point-matrices)                                                                    | 64-bit floating-point matrix multiply extension (FEAT_F64MM)                                       | 1           |
@@ -11699,7 +11712,7 @@ Multi-vector floating-point fused multiply-add/subtract
     __arm_streaming __arm_inout("za");
   ```
 
-#### BFMLA. BFMLS, FMLA, FMLS (indexed)
+#### BFMLA, BFMLS, FMLA, FMLS (indexed)
 
 Multi-vector floating-point fused multiply-add/subtract
 
@@ -12792,6 +12805,29 @@ element types.
   svint8x4_t svuzpq[_s8_x4](svint8x4_t zn) __arm_streaming;
   ```
 
+#### BFMUL
+
+BFloat16 Multi-vector floating-point multiply
+
+``` c
+  // Only if __ARM_FEATURE_SVE_BFSCALE != 0
+   svbfloat16x2_t svmul[_bf16_x2](svbfloat16x2_t zd, svbfloat16x2_t zm) __arm_streaming;
+   svbfloat16x2_t svmul[_single_bf16_x2](svbfloat16x2_t zd, svbfloat16_t zm) __arm_streaming;
+   svbfloat16x4_t svmul[_bf16_x4](svbfloat16x4_t zd, svbfloat16x4_t zm) __arm_streaming;
+   svbfloat16x4_t svmul[_single_bf16_x4](svbfloat16x4_t zd, svbfloat16_t zm) __arm_streaming;
+   ```
+
+#### BFSCALE
+BFloat16 floating-point adjust exponent vectors.
+
+``` c
+  // Only if __ARM_FEATURE_SVE_BFSCALE != 0
+   svbfloat16x2_t svscale[_bf16_x2](svbfloat16x2_t zdn, svint16x2_t zm);
+   svbfloat16x2_t svscale[_single_bf16_x2](svbfloat16x2_t zn, svint16_t zm);
+   svbfloat16x4_t svscale[_bf16_x4](svbfloat16x4_t zdn, svint16x4_t zm);
+   svbfloat16x4_t svscale[_single_bf16_x4](svbfloat16x4_t zn, svint16_t zm);
+   ```
+
 ### SME2.1 instruction intrinsics
 
 The specification for SME2.1 is in
@@ -12937,6 +12973,33 @@ Zero ZA vector groups
     __arm_streaming __arm_inout("za");
 ```
 
+### SME2.2 instruction intrinsics
+
+The intrinsics in this section are defined by the header file
+[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2p2` is defined.
+
+#### FMUL
+
+Multi-vector floating-point multiply
+
+``` c
+  // Variants are also available for:
+  // [_single_f32_x2]
+  // [_single_f64_x2]
+  // [_single_f16_x4]
+  // [_single_f32_x4]
+  // [_single_f64_x4]
+  svfloat16x2_t svmul[_single_f16_x2](svfloat16x2_t zd, svfloat16_t zm) __arm_streaming;
+
+  // Variants are also available for:
+  // [_f32_x2]
+  // [_f64_x2]
+  // [_f16_x4]
+  // [_f32_x4]
+  // [_f64_x4]
+  svfloat16x2_t svmul[_f16_x2](svfloat16x2_t zd, svfloat16x2_t zm) __arm_streaming;
+```
+
 ### Streaming-compatible versions of standard routines
 
 ACLE provides the following streaming-compatible functions,