cpu: aarch64: add ASIMD softmax JIT implementation #4441

Anndrey24 · 2025-12-09T13:31:42Z

Description

This commit introduces an f32 ASIMD softmax JIT implementation using the exp eltwise injector added in #4376, while also improving performance for the existing sve_* implementations (primarily by increasing the unrolling factor unroll_regs_ and skipping the multiplication with default dequantization / requantization factors src_scales / dst_scales). For jit:asimd and jit:sve_128, the exp function is also effectively inlined by setting preserve_vmm = false, whereas jit:sve_256 did not benefit from such a change.

As the previous softmax implementation heavily relied on predicated instructions, jit_softmax_base_t was refactored to only include common logic for SVE and non-SVE implementations alike. At the same time, two different derived constructs were added to handle ISA-specific work: jit_softmax_sve_t and jit_softmax_asimd_t.

In addition, the JIT eltwise injector was changed to support storing/loading preserved vectors on non-SVE targets.

Performance improvements (f32)

c6g

Shape	Threads	jit:asimd (ms)	acl (ms)	Speedup
1539x387	1	1.21689	1.5615	1.28
1539x387	4	0.306583	0.394197	1.29
1539x387	16	0.078976	0.103172	1.31
1539x387	64	0.02816	0.04522	1.61
1024x4096	1	8.12552	10.4083	1.28
1024x4096	4	2.05314	2.62449	1.28
1024x4096	16	0.526042	0.678114	1.29
1024x4096	64	0.13881	0.182793	1.32
4096x4096	1	32.5925	41.3373	1.27
4096x4096	4	8.19186	10.3651	1.27
4096x4096	16	2.0928	2.66398	1.27
4096x4096	64	0.734764	0.937735	1.28

c7g

Shape	Threads	jit:sve_256 (after)	jit:sve_256 (before)	Speedup
1539x387	1	0.58083	0.748606	1.29
1539x387	4	0.152118	0.189787	1.25
1539x387	16	0.039454	0.049228	1.25
1539x387	64	0.018498	0.021218	1.15
1024x4096	1	3.7424	5.12185	1.37
1024x4096	4	0.939754	1.30929	1.39
1024x4096	16	0.233352	0.329952	1.41
1024x4096	64	0.081774	0.108232	1.32
4096x4096	1	15.4883	20.4236	1.32
4096x4096	4	3.95416	5.56156	1.41
4096x4096	16	1.0644	1.43602	1.35
4096x4096	64	0.364805	0.432615	1.19

c8g

Shape	Threads	jit:sve_128 (after)	jit:sve_128 (before)	Speedup
1539x387	1	0.628566	0.863312	1.37
1539x387	4	0.158893	0.217245	1.37
1539x387	16	0.041718	0.055711	1.34
1539x387	64	0.018232	0.023519	1.29
1024x4096	1	4.39546	6.07039	1.38
1024x4096	4	1.09701	1.50691	1.37
1024x4096	16	0.280239	0.367653	1.31
1024x4096	64	0.089297	0.130347	1.46
4096x4096	1	18.8842	24.4886	1.30
4096x4096	4	4.79858	6.25783	1.30
4096x4096	16	1.24212	1.58102	1.27
4096x4096	64	0.323831	0.478697	1.48

michalowski-arm · 2025-12-09T13:50:24Z

As this change is pretty big, do you think it would be possible to neatly split it into two commits: one for the sve optimizations and one for the asimd impl? The sve changes should even maybe be a separate PR.

This commit moves all SVE-specific code into a new construct `jit_softmax_sve_t`.

This commit introduces an f32 ASIMD `softmax` JIT implementation.

Anndrey24 · 2025-12-09T17:17:57Z

I've now split up the changes into 3 separate commits:

cpu: aarch64: refactor jit_uni_softmax:
- keeps ISA-agnostic logic in jit_softmax_base_t, while all SVE-specific code is moved into a new construct jit_softmax_sve_t.
- most of the changes are due to indentation differences.
cpu: aarch64: add ASIMD softmax JIT implementation:
- adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct jit_softmax_base_t.
cpu: aarch64: improve SVE JIT softmax performance:
- adapts some of the ASIMD performance gains for the SVE kernels too, in particular SVE 128 as they share the same vector length.

I will move the final commit to a follow-up PR if you think that's best. I've only left all 3 together for now as the c7g/c8g speedups would be less noticeable at a glance with the SVE improvements in commits 2 and 3 split up, compared to being altogether in a single table like this.

This commit adapts some of the ASIMD softmax changes for the SVE kernels. In particular, the `jit:sve_128` logic more closely resembles `jit:asimd` (e.g. its `exp` eltwise injector is inlined and uses `compute_vector_range()` instead of `compute_vector()`).

jondea

This looks really good, thank you! Couple of comments, I will have another look over and will probably have more.

jondea · 2025-12-18T10:48:02Z

src/cpu/aarch64/injectors/jit_uni_eltwise_injector.cpp

    const auto &t4 = VReg4S(vmm_aux3.getIdx());
    const auto &t_tmp = VReg4S(vmm_tmp.getIdx());

+    const float special_bound_input = 126.5f * logf(2.0f);


Can this variable have a more specific name?

jondea · 2025-12-18T10:48:44Z

src/cpu/aarch64/injectors/jit_uni_eltwise_injector.cpp

-    h->fmov(h->X_TMP_0, DReg(t_tmp.getIdx()));
-    h->cbnz(h->X_TMP_0, L_special);
+    if (need_special_case) {
+        // Check if any lane needs special-case handling


Does this include NaN and Inf? I think a couple of comments explaining the flow for them would be useful

Anndrey24 requested review from a team as code owners December 9, 2025 13:31

github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:common labels Dec 9, 2025

Anndrey24 added 2 commits December 9, 2025 15:00

cpu: aarch64: refactor jit_uni_softmax

f1a4646

This commit moves all SVE-specific code into a new construct `jit_softmax_sve_t`.

cpu: aarch64: add ASIMD softmax JIT implementation

f6eaed4

This commit introduces an f32 ASIMD `softmax` JIT implementation.

Anndrey24 force-pushed the softmax-jit branch from 9555ee7 to d614a2c Compare December 9, 2025 17:12

Anndrey24 force-pushed the softmax-jit branch from d614a2c to 61355d8 Compare December 16, 2025 08:24

jondea reviewed Dec 18, 2025

View reviewed changes

dzarukin approved these changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: aarch64: add ASIMD softmax JIT implementation #4441

cpu: aarch64: add ASIMD softmax JIT implementation #4441

Anndrey24 commented Dec 9, 2025 •

edited

Loading

Uh oh!

michalowski-arm commented Dec 9, 2025

Uh oh!

Anndrey24 commented Dec 9, 2025 •

edited

Loading

Uh oh!

jondea left a comment

Uh oh!

jondea Dec 18, 2025

Uh oh!

jondea Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cpu: aarch64: add ASIMD softmax JIT implementation #4441

Are you sure you want to change the base?

cpu: aarch64: add ASIMD softmax JIT implementation #4441

Conversation

Anndrey24 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance improvements (f32)

c6g

c7g

c8g

Uh oh!

michalowski-arm commented Dec 9, 2025

Uh oh!

Anndrey24 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jondea left a comment

Choose a reason for hiding this comment

Uh oh!

jondea Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jondea Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Anndrey24 commented Dec 9, 2025 •

edited

Loading

Anndrey24 commented Dec 9, 2025 •

edited

Loading