You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NEON kernels for NCHWc Convolution and Pooling (#25580)
### Description
This PR implements optimized Arm NEON kernels for NCHWc (channels-last
with channel blocking) convolution and pooling operations in MLAS,
significantly improving performance on Arm64 platforms.
### Motivation and Context
Fixes#24790
The new NCHWc kernels improve performance by 5-6x, depending on the
configuration of threads, model, etc.
For example, here is the performance gain witnessed during mobilenet
inference: Focus on the "Number of inferences per second" (93 inf/s ->
498 inf/s)
<details>
<summary>System configuration</summary>
```
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model name: Neoverse-V2
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r0p1
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp
sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 128 MiB (64 instances)
L3: 36 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
```
</details>
<details>
<summary>Perf with current upstream kernels</summary>
```
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx
Setting intra_op_num_threads to 32
Session creation time cost: 0.0238608 s
First inference time cost: 11 ms
Total inference time cost: 10.7458 s
Total inference requests: 1000
Average inference time cost: 10.7458 ms
Total inference run time: 10.7465 s
Number of inferences per second: 93.0534
Avg CPU usage: 50 %
Peak working set size: 70410240 bytes
Avg CPU usage:50
Peak working set size:70410240
Runs:1000
Min Latency: 0.0106707 s
Max Latency: 0.0113617 s
P50 Latency: 0.0107453 s
P90 Latency: 0.0107695 s
P95 Latency: 0.0107785 s
P99 Latency: 0.0107965 s
P999 Latency: 0.0113617 s
```
</details>
<details>
<summary>Perf with NCHWc kernels</summary>
```
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx
Setting intra_op_num_threads to 32
Session creation time cost: 0.0358121 s
First inference time cost: 2 ms
Total inference time cost: 2.00561 s
Total inference requests: 1000
Average inference time cost: 2.00561 ms
Total inference run time: 2.00607 s
Number of inferences per second: 498.488
Avg CPU usage: 50 %
Peak working set size: 92467200 bytes
Avg CPU usage:50
Peak working set size:92467200
Runs:1000
Min Latency: 0.00198387 s
Max Latency: 0.00204784 s
P50 Latency: 0.00200537 s
P90 Latency: 0.0020155 s
P95 Latency: 0.00201822 s
P99 Latency: 0.0020251 s
P999 Latency: 0.00204784 s
```
</details>
Happy to run further performance tests as required.
0 commit comments