Skip to content

Conversation

@Anndrey24
Copy link
Contributor

@Anndrey24 Anndrey24 commented Dec 9, 2025

Description

This commit introduces an f32 ASIMD softmax JIT implementation using the exp eltwise injector added in #4376, while also improving performance for the existing sve_* implementations (primarily by increasing the unrolling factor unroll_regs_ and skipping the multiplication with default dequantization / requantization factors src_scales / dst_scales). For jit:asimd and jit:sve_128, the exp function is also effectively inlined by setting preserve_vmm = false, whereas jit:sve_256 did not benefit from such a change.

As the previous softmax implementation heavily relied on predicated instructions, jit_softmax_base_t was refactored to only include common logic for SVE and non-SVE implementations alike. At the same time, two different derived constructs were added to handle ISA-specific work: jit_softmax_sve_t and jit_softmax_asimd_t.

In addition, the JIT eltwise injector was changed to support storing/loading preserved vectors on non-SVE targets.

Performance improvements (f32)

c6g

Shape Threads jit:asimd (ms) acl (ms) Speedup
1539x387 1 1.21689 1.5615 1.28
1539x387 4 0.306583 0.394197 1.29
1539x387 16 0.078976 0.103172 1.31
1539x387 64 0.02816 0.04522 1.61
1024x4096 1 8.12552 10.4083 1.28
1024x4096 4 2.05314 2.62449 1.28
1024x4096 16 0.526042 0.678114 1.29
1024x4096 64 0.13881 0.182793 1.32
4096x4096 1 32.5925 41.3373 1.27
4096x4096 4 8.19186 10.3651 1.27
4096x4096 16 2.0928 2.66398 1.27
4096x4096 64 0.734764 0.937735 1.28

c7g

Shape Threads jit:sve_256 (after) jit:sve_256 (before) Speedup
1539x387 1 0.58083 0.748606 1.29
1539x387 4 0.152118 0.189787 1.25
1539x387 16 0.039454 0.049228 1.25
1539x387 64 0.018498 0.021218 1.15
1024x4096 1 3.7424 5.12185 1.37
1024x4096 4 0.939754 1.30929 1.39
1024x4096 16 0.233352 0.329952 1.41
1024x4096 64 0.081774 0.108232 1.32
4096x4096 1 15.4883 20.4236 1.32
4096x4096 4 3.95416 5.56156 1.41
4096x4096 16 1.0644 1.43602 1.35
4096x4096 64 0.364805 0.432615 1.19

c8g

Shape Threads jit:sve_128 (after) jit:sve_128 (before) Speedup
1539x387 1 0.628566 0.863312 1.37
1539x387 4 0.158893 0.217245 1.37
1539x387 16 0.041718 0.055711 1.34
1539x387 64 0.018232 0.023519 1.29
1024x4096 1 4.39546 6.07039 1.38
1024x4096 4 1.09701 1.50691 1.37
1024x4096 16 0.280239 0.367653 1.31
1024x4096 64 0.089297 0.130347 1.46
4096x4096 1 18.8842 24.4886 1.30
4096x4096 4 4.79858 6.25783 1.30
4096x4096 16 1.24212 1.58102 1.27
4096x4096 64 0.323831 0.478697 1.48

@Anndrey24 Anndrey24 requested review from a team as code owners December 9, 2025 13:31
@github-actions github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:common labels Dec 9, 2025
@michalowski-arm
Copy link
Contributor

As this change is pretty big, do you think it would be possible to neatly split it into two commits: one for the sve optimizations and one for the asimd impl? The sve changes should even maybe be a separate PR.

This commit moves all SVE-specific code into a new construct `jit_softmax_sve_t`.
This commit introduces an f32 ASIMD `softmax` JIT implementation.
@Anndrey24
Copy link
Contributor Author

Anndrey24 commented Dec 9, 2025

I've now split up the changes into 3 separate commits:

  1. cpu: aarch64: refactor jit_uni_softmax:
    • keeps ISA-agnostic logic in jit_softmax_base_t, while all SVE-specific code is moved into a new construct jit_softmax_sve_t.
    • most of the changes are due to indentation differences.
  2. cpu: aarch64: add ASIMD softmax JIT implementation:
    • adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct jit_softmax_base_t.
  3. cpu: aarch64: improve SVE JIT softmax performance:
    • adapts some of the ASIMD performance gains for the SVE kernels too, in particular SVE 128 as they share the same vector length.

I will move the final commit to a follow-up PR if you think that's best. I've only left all 3 together for now as the c7g/c8g speedups would be less noticeable at a glance with the SVE improvements in commits 2 and 3 split up, compared to being altogether in a single table like this.

This commit adapts some of the ASIMD softmax changes for the SVE  kernels.

In particular, the `jit:sve_128` logic more closely resembles `jit:asimd` (e.g. its `exp` eltwise injector is inlined and uses `compute_vector_range()` instead of `compute_vector()`).
Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thank you! Couple of comments, I will have another look over and will probably have more.

const auto &t4 = VReg4S(vmm_aux3.getIdx());
const auto &t_tmp = VReg4S(vmm_tmp.getIdx());

const float special_bound_input = 126.5f * logf(2.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this variable have a more specific name?

h->fmov(h->X_TMP_0, DReg(t_tmp.getIdx()));
h->cbnz(h->X_TMP_0, L_special);
if (need_special_case) {
// Check if any lane needs special-case handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include NaN and Inf? I think a couple of comments explaining the flow for them would be useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants