Skip to content

Add SIMDE support for portable SIMD on non-x86 platforms#124

Open
wszqkzqk wants to merge 1 commit intomborgerding:masterfrom
wszqkzqk:simde-alias
Open

Add SIMDE support for portable SIMD on non-x86 platforms#124
wszqkzqk wants to merge 1 commit intomborgerding:masterfrom
wszqkzqk:simde-alias

Conversation

@wszqkzqk
Copy link

@wszqkzqk wszqkzqk commented Feb 4, 2026

This PR adds optional support for SIMDE (SIMD Everywhere). This allows the existing SSE-optimized codebase to compile and run with hardware acceleration on non-x86 architectures (such as ARM64/NEON, RISC-V, POWER, and WASM) without rewriting the core algorithm.

Features

  • The default behavior is unchanged. Native reference xmmintrin.h usage is preserved unless KISSFFT_USE_SIMDE is explicitly enabled.
  • The x86-specific _mm_malloc is replaced with portable aligned allocation strategies (C11 aligned_alloc, POSIX posix_memalign, or MSVC _aligned_malloc) only when building in SIMDE mode.
  • Follows SIMDE best practices by avoiding SIMDE_ENABLE_NATIVE_ALIASES. Instead, it uses simde_ prefixed intrinsics to prevent conflicts with system headers.

Build

CMake:

cmake -DKISSFFT_DATATYPE=simd -DKISSFFT_USE_SIMDE=ON ..

Makefile:

make KISSFFT_DATATYPE=simd KISSFFT_USE_SIMDE=1

Testing

I've verified that standard x86 builds (KISSFFT_DATATYPE=simd) compile with -msse and behave identically to master and SIMDE builds compile successfully and pass test/tsimd.

I've also verified it on LoongArch64. Tests were performed on LoongArch64 (Arch Linux for Loong64, Loongson 3C5000L).

Benchmark Results (nfft=1800)

  • Scalar (float): 1.820s
  • SIMD (SIMDe): 0.510s

~3.57x faster on Loongson 3C5000L.

With ctest:

    Start 1: bm_kiss
1/9 Test #1: bm_kiss ..........................   Passed    0.06 sec
    Start 2: bm_fftw
2/9 Test #2: bm_fftw ..........................   Passed    0.00 sec
    Start 3: st
3/9 Test #3: st ...............................   Passed    0.01 sec
    Start 4: tkfc
4/9 Test #4: tkfc .............................   Passed    0.00 sec
    Start 5: ffr
5/9 Test #5: ffr ..............................   Passed    0.00 sec
    Start 6: tr
6/9 Test #6: tr ...............................   Passed    0.14 sec
    Start 7: testcpp
7/9 Test #7: testcpp ..........................   Passed  301.38 sec
    Start 8: tsimd
8/9 Test #8: tsimd ............................   Passed    0.02 sec
    Start 9: testkiss.py
9/9 Test #9: testkiss.py ......................   Passed    0.21 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 301.84 sec

With Makefile:

since SIMD implementation does 4 ffts at a time, numffts is being reduced to 2500
KISS    nfft=1800,      numffts=2500
COMMAND         MAJFLT MINFLT   RSS   DRS PAGEIN    SZ  TRS    VSZ
bm-kiss-simd         0    159  2240  3672      0   230    7   3680
        cputime=0.510
Datatype not available in FFTW
testkiss.py does not yet test simdg++ -o testcpp-simd -O3 -ffast-math -fomit-frame-pointer  -I.. -W -Wall -march=native -mtune=native testcpp.cc -L.. -lkissfft-simd -lm
LD_LIBRARY_PATH=":.." ./testcpp-simd
type:f nfft:32 RMSE:9.35194e-08  MSPS:89.9729
type:d nfft:32 RMSE:1.26246e-16  MSPS:91.6864
type:e nfft:32 RMSE:3.8172e-33   MSPS:0.900086
type:f nfft:1024 RMSE:1.81694e-07        MSPS:60.6544
type:d nfft:1024 RMSE:3.30505e-16        MSPS:60.6483
type:e nfft:1024 RMSE:1.38033e-31        MSPS:0.386512
type:f nfft:840 RMSE:5.01374e-07         MSPS:33.8333
type:d nfft:840 RMSE:3.22384e-16         MSPS:30.0022
type:e nfft:840 RMSE:1.09615e-31         MSPS:0.201911

This commit adds optional SIMDE (SIMD Everywhere) support to enable
SIMD vectorization on non-x86 platforms such as ARM, RISC-V, LoongArch,
WebAssembly, and others.

Signed-off-by: Zhou Qiankang <wszqkzqk@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant