Skip to content

Commit 96f57d8

Browse files
authored
Merge branch 'main' into vmla_vmls_floats
2 parents 83c590f + 777536d commit 96f57d8

File tree

9 files changed

+82
-11
lines changed

9 files changed

+82
-11
lines changed

cmse/cmse.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date-of-issue: 21 June 2024
55
set-quote-highlight: true
66
# LaTeX specific variables
77
copyright-text: Copyright 2019, 2021-2024 Arm Limited and/or its affiliates <open-source-office@arm.com>.
8-
draftversion: true
8+
draftversion: false
99
# Jekyll specific variables
1010
header_counter: true
1111
toc: true

main/acle.md

Lines changed: 59 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
22
title: Arm C Language Extensions
3-
version: 2025Q2
4-
date-of-issue: 06 June 2025
3+
version: 2025Q3
4+
date-of-issue: 14 November 2025
55
# LaTeX specific variables
66
copyright-text: "Copyright: see section \\texorpdfstring{\\nameref{copyright}}{Copyright}."
7-
draftversion: true
7+
draftversion: false
88
# Jekyll specific variables
99
header_counter: true
1010
toc: true
@@ -181,6 +181,7 @@ unless a different support level is specified in the text.
181181
| 2024Q3 | 30 September 2024 | Arm | See [Changes between ACLE Q2 2024 and ACLE Q3 2024](#changes-between-acle-q2-2024-and-acle-q3-2024) |
182182
| 2024Q4 | 21 February 2025 | Arm | See [Changes between ACLE Q3 2024 and ACLE Q4 2024](#changes-between-acle-q3-2024-and-acle-q4-2024) |
183183
| 2025Q2 | 06 June 2025 | Arm | See [Changes between ACLE Q4 2024 and ACLE Q2 2025](#changes-between-acle-q4-2024-and-acle-q2-2025) |
184+
| 2025Q3 | 14 November 2025 | Arm | See [Changes between ACLE Q2 2025 and ACLE Q3 2025](#changes-between-acle-q2-2025-and-acle-q3-2025) |
184185

185186
#### Changes between ACLE Q2 2017 and ACLE Q2 2018
186187

@@ -446,7 +447,6 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
446447
* Added `svdot[_n_f16_mf8]_fpm` and `svdot[_n_f32_mf8]_fpm`.
447448
* Added Guarded Control Stack (GCS) at
448449
[**Beta**](#current-status-and-anticipated-changes) quality level.
449-
* Add Function Multi Versioning feature priority syntax.
450450

451451
#### Changes between ACLE Q4 2024 and ACLE Q2 2025
452452

@@ -461,10 +461,16 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
461461
* Upgrade to [**Beta**](#current-status-and-anticipated-changes)
462462
support for modal 8-bit floating point intrinsics.
463463

464-
#### Changes for next release
464+
#### Changes between ACLE Q2 2025 and ACLE Q3 2025
465465

466466
* Added feature test macro for FEAT_SSVE_FEXPA.
467467
* Added feature test macro for FEAT_CSSC.
468+
* Added Function Multi Versioning feature priority syntax.
469+
470+
#### Changes for next release
471+
472+
* Added support for modal 8-bit floating point matrix multiply-accumulate widening intrinsics.
473+
* Added support for 16-bit floating point matrix multiply-accumulate widening intrinsics.
468474
* Improve documentation for VMLA/VMLS intrinsics for floats.
469475

470476
### References
@@ -2347,6 +2353,26 @@ is hardware support for the SVE forms of these instructions and if the
23472353
associated ACLE intrinsics are available. This implies that
23482354
`__ARM_FEATURE_MATMUL_INT8` and `__ARM_FEATURE_SVE` are both nonzero.
23492355

2356+
##### Multiplication of modal 8-bit floating-point matrices
2357+
2358+
This section is in
2359+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
2360+
extended in the future.
2361+
2362+
`__ARM_FEATURE_F8F16MM` is defined to `1` if there is hardware support
2363+
for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to half-precision (FEAT_F8F16MM)
2364+
instructions and if the associated ACLE intrinsics are available.
2365+
2366+
`__ARM_FEATURE_F8F32MM` is defined to `1` if there is hardware support
2367+
for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to single-precision (FEAT_F8F32MM)
2368+
instructions and if the associated ACLE intrinsics are available.
2369+
2370+
##### Multiplication of 16-bit floating-point matrices
2371+
2372+
`__ARM_FEATURE_SVE_F16F32MM` is defined to `1` if there is hardware support
2373+
for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
2374+
(FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.
2375+
23502376
##### Multiplication of 32-bit floating-point matrices
23512377

23522378
`__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
@@ -2638,6 +2664,9 @@ be found in [[BA]](#BA).
26382664
| [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 |
26392665
| [`__ARM_FEATURE_SVE_MATMUL_FP32`](#multiplication-of-32-bit-floating-point-matrices) | 32-bit floating-point matrix multiply extension (FEAT_F32MM) | 1 |
26402666
| [`__ARM_FEATURE_SVE_MATMUL_FP64`](#multiplication-of-64-bit-floating-point-matrices) | 64-bit floating-point matrix multiply extension (FEAT_F64MM) | 1 |
2667+
| [`__ARM_FEATURE_F8F16MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to half-precision extension (FEAT_F8F16MM) | 1 |
2668+
| [`__ARM_FEATURE_F8F32MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_F8F32MM) | 1 |
2669+
| [`__ARM_FEATURE_SVE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices) | 16-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_SVE_F16F32MM) | 1 |
26412670
| [`__ARM_FEATURE_SVE_MATMUL_INT8`](#multiplication-of-8-bit-integer-matrices) | SVE support for the integer matrix multiply extension (FEAT_I8MM) | 1 |
26422671
| [`__ARM_FEATURE_SVE_PREDICATE_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE vector types | 1 |
26432672
| [`__ARM_FEATURE_SVE_VECTOR_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE predicate types | 1 |
@@ -9375,6 +9404,31 @@ BFloat16 floating-point multiply vectors.
93759404
uint64_t imm_idx);
93769405
```
93779406

9407+
### SVE2 floating-point matrix multiply-accumulate instructions.
9408+
9409+
#### FMMLA (widening, FP8 to FP16)
9410+
9411+
Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
9412+
```c
9413+
// Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F16MM)
9414+
svfloat16_t svmmla[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
9415+
```
9416+
9417+
#### FMMLA (widening, FP8 to FP32)
9418+
9419+
Modal 8-bit floating-point matrix multiply-accumulate to single-precision.
9420+
```c
9421+
// Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F32MM)
9422+
svfloat32_t svmmla[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
9423+
```
9424+
#### FMMLA (widening, FP16 to FP32)
9425+
9426+
16-bit floating-point matrix multiply-accumulate to single-precision.
9427+
```c
9428+
// Only if __ARM_FEATURE_SVE_F16F32MM
9429+
svfloat32_t svmmla[_f32_f16](svfloat32_t zda, svfloat16_t zn, svfloat16_t zm);
9430+
```
9431+
93789432
### SVE2.1 instruction intrinsics
93799433

93809434
The specification for SVE2.1 is in

morello/morello.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ version: 02alpha
44
date-of-issue: 11 January 2022
55
# LaTeX specific variables
66
copyright-text: Copyright 2020-2022 Arm Limited and/or its affiliates <open-source-office@arm.com>.
7-
draftversion: true
7+
draftversion: false
88
# Jekyll specific variables
99
header_counter: true
1010
toc: true

mve_intrinsics/mve.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date-of-issue: 11 January 2022
55
# LaTeX specific variables
66
landscape: true
77
copyright-text: Copyright 2019-2022 Arm Limited and/or its affiliates <open-source-office@arm.com>.
8-
draftversion: true
8+
draftversion: false
99
# Jekyll specific variables
1010
header_counter: true
1111
toc: true

mve_intrinsics/mve.template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date-of-issue: 11 January 2022
55
# LaTeX specific variables
66
landscape: true
77
copyright-text: Copyright 2019-2022 Arm Limited and/or its affiliates <open-source-office@arm.com>.
8-
draftversion: true
8+
draftversion: false
99
# Jekyll specific variables
1010
header_counter: true
1111
toc: true

neon_intrinsics/advsimd.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date-of-issue: 06 June 2025
55
# LaTeX specific variables
66
landscape: true
77
copyright-text: "Copyright: see section \\texorpdfstring{\\nameref{copyright}}{Copyright}."
8-
draftversion: true
8+
draftversion: false
99
# Jekyll specific variables
1010
header_counter: true
1111
toc: true
@@ -6175,3 +6175,14 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``.
61756175
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlalltbq_laneq_f32_mf8_fpm" target="_blank">vmlalltbq_laneq_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
61766176
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlallttq_lane_f32_mf8_fpm" target="_blank">vmlallttq_lane_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x8_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
61776177
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlallttq_laneq_f32_mf8_fpm" target="_blank">vmlallttq_laneq_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
6178+
6179+
## Matrix multiplication intrinsics from Armv9.6-A
6180+
6181+
### Vector arithmetic
6182+
6183+
#### Matrix multiply
6184+
6185+
| Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures |
6186+
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------------------------|-------------------|---------------------------|
6187+
| <code>float16x8_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmmlaq_f16_mf8" target="_blank">vmmlaq_f16_mf8</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float16x8_t r,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t a,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t b,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `r -> Vd.4H`<br>`a -> Vn.16B`<br>`b -> Vm.16B` | `FMMLA Vd.4H, Vn.16B, Vm.16B` | `Vd.4H -> result` | `A64` |
6188+
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmmlaq_f32_mf8" target="_blank">vmmlaq_f32_mf8</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t r,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t a,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t b,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `r -> Vd.4S`<br>`a -> Vn.16B`<br>`b -> Vm.16B` | `FMMLA Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` |

neon_intrinsics/advsimd.template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date-of-issue: 06 June 2025
55
# LaTeX specific variables
66
landscape: true
77
copyright-text: "Copyright: see section \\texorpdfstring{{\\nameref{{copyright}}}}{{Copyright}}."
8-
draftversion: true
8+
draftversion: false
99
# Jekyll specific variables
1010
header_counter: true
1111
toc: true

tools/intrinsic_db/advsimd.csv

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4810,3 +4810,7 @@ float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x
48104810
float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
48114811
float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
48124812
float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
4813+
4814+
<SECTION> Matrix multiplication intrinsics from Armv9.6-A
4815+
float16x8_t vmmlaq_f16_mf8(float16x8_t r, mfloat8x16_t a, mfloat8x16_t b, fpm_t fpm) r -> Vd.4H;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4H, Vn.16B, Vm.16B Vd.4H -> result A64
4816+
float32x4_t vmmlaq_f32_mf8(float32x4_t r, mfloat8x16_t a, mfloat8x16_t b, fpm_t fpm) r -> Vd.4S;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64

tools/intrinsic_db/advsimd_classification.csv

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4697,3 +4697,5 @@ vmlalltbq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and wi
46974697
vmlalltbq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
46984698
vmlallttq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
46994699
vmlallttq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
4700+
vmmlaq_f16_mf8 Vector arithmetic|Matrix multiply
4701+
vmmlaq_f32_mf8 Vector arithmetic|Matrix multiply

0 commit comments

Comments
 (0)