Array ops not as optimized as it could be

Following: https://salykova.github.io/gemm-cpu
I decided to have a play with array ops specifically for the kernel:

```d
void kernel4(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = b[] * a[];
    o[] += c[];
}
```

For all of these examples you need ``-mattr=+avx512vl``

Right now the codegen for this is abysmal it is nearly 90 lines full of ``vmovss`` and ``vmulss``.
Compare it to explicit simd types:

```d
void kernel1(ref float16 o, ref const float16 a, ref const float16 b, ref const float16 c) {
    o = (b * a) + c;
}
```

```
_D7example7kernel1FKNhG16fKxNhQiKxQgKxQkZv:
.Lfunc_begin6:
        .cfi_startproc
        .loc    1 34 5 prologue_end
        vmovaps (%rdx), %ymm0
        vmovaps 32(%rdx), %ymm1
        vmovaps (%rsi), %ymm2
        vmovaps 32(%rsi), %ymm3
        vfmadd213ps     32(%rcx), %ymm1, %ymm3
        vfmadd213ps     (%rcx), %ymm0, %ymm2
        vmovaps %ymm2, (%rdi)
        vmovaps %ymm3, 32(%rdi)
        .loc    1 35 1
        vzeroupper
        retq
```

Quite different.

With some modifications to core.internal.array.operations we can get the codegen for kernel4 to be:

```
0000000000000000 <_D4main7kernel4FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  12:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  18:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  1e:   c5 f8 77                vzeroupper
  21:   c3                      ret
```

The modifications required are:

Turn on the SIMD helper code (gated by DigitalMars).
Switch ``regsz`` from 16 to 64.

Swap out store+load, for their more generic pointer casts. ``*(cast(vec*)p) = val;`` ``return *cast(vec*)p;``

Keep dmd's ``vectorizeable`` behavior.

With an additional forced inline of arrayOp:

```d
void kernel3(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = (b[] * a[]) + c[];
}
```

We then get:

```
0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  12:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  18:   c5 f8 77                vzeroupper
  1b:   c3                      ret
```

It looks like llvm cannot combine the multiply + add, but this is significantly better than what it is currently.

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Array ops not as optimized as it could be #4991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Array ops not as optimized as it could be #4991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions