I noticed, that the Arm Neoverse scheduling models have a way to large decoding bandwidth: https://godbolt.org/z/54hPqeqdK
I tested how many independent adds llvm-mca thinks the cores can decode per cycle and compared it with the actual decode with:
- CPU: llvm-mca vs Arm-Software-Optimization-Guide "4.1 Dispatch constraints"
- Neoverse-V1: 15 vs 8
- Neoverse-V2: 16 vs 8
- Neoverse-V3: 16 vs 10
- Neoverse-N1: 8 vs 4
- Neoverse-N2: 10 vs 5
- Neoverse-N3: 10 vs 5
The decode/issue width currently used in the scheduling models seems to correspond to the number of uops that can be processed, not MOPs, that are decoded or read from opcache.
Still, unless the cores are capable of fusing independent additions, they shouldn't be able to decode the instructions this quickly.
Here is a code snippet where the additional decode capabilities cause an impossible result: https://godbolt.org/z/GbGrKWxsq
Here the V1 can execute a loop with 13 instructions with 13 IPC, even though it should only be able to decode up to 8 instructions per cycle.