Skip to content

Array ops not as optimized as it could be #4991

@rikkimax

Description

@rikkimax

Following: https://salykova.github.io/gemm-cpu
I decided to have a play with array ops specifically for the kernel:

void kernel4(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = b[] * a[];
    o[] += c[];
}

For all of these examples you need -mattr=+avx512vl

Right now the codegen for this is abysmal it is nearly 90 lines full of vmovss and vmulss.
Compare it to explicit simd types:

void kernel1(ref float16 o, ref const float16 a, ref const float16 b, ref const float16 c) {
    o = (b * a) + c;
}
_D7example7kernel1FKNhG16fKxNhQiKxQgKxQkZv:
.Lfunc_begin6:
        .cfi_startproc
        .loc    1 34 5 prologue_end
        vmovaps (%rdx), %ymm0
        vmovaps 32(%rdx), %ymm1
        vmovaps (%rsi), %ymm2
        vmovaps 32(%rsi), %ymm3
        vfmadd213ps     32(%rcx), %ymm1, %ymm3
        vfmadd213ps     (%rcx), %ymm0, %ymm2
        vmovaps %ymm2, (%rdi)
        vmovaps %ymm3, 32(%rdi)
        .loc    1 35 1
        vzeroupper
        retq

Quite different.

With some modifications to core.internal.array.operations we can get the codegen for kernel4 to be:

0000000000000000 <_D4main7kernel4FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  12:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  18:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  1e:   c5 f8 77                vzeroupper
  21:   c3                      ret

The modifications required are:

Turn on the SIMD helper code (gated by DigitalMars).
Switch regsz from 16 to 64.

Swap out store+load, for their more generic pointer casts. *(cast(vec*)p) = val; return *cast(vec*)p;

Keep dmd's vectorizeable behavior.

With an additional forced inline of arrayOp:

void kernel3(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = (b[] * a[]) + c[];
}

We then get:

0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  12:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  18:   c5 f8 77                vzeroupper
  1b:   c3                      ret

It looks like llvm cannot combine the multiply + add, but this is significantly better than what it is currently.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions