-
-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Following: https://salykova.github.io/gemm-cpu
I decided to have a play with array ops specifically for the kernel:
void kernel4(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
o[] = b[] * a[];
o[] += c[];
}For all of these examples you need -mattr=+avx512vl
Right now the codegen for this is abysmal it is nearly 90 lines full of vmovss and vmulss.
Compare it to explicit simd types:
void kernel1(ref float16 o, ref const float16 a, ref const float16 b, ref const float16 c) {
o = (b * a) + c;
}_D7example7kernel1FKNhG16fKxNhQiKxQgKxQkZv:
.Lfunc_begin6:
.cfi_startproc
.loc 1 34 5 prologue_end
vmovaps (%rdx), %ymm0
vmovaps 32(%rdx), %ymm1
vmovaps (%rsi), %ymm2
vmovaps 32(%rsi), %ymm3
vfmadd213ps 32(%rcx), %ymm1, %ymm3
vfmadd213ps (%rcx), %ymm0, %ymm2
vmovaps %ymm2, (%rdi)
vmovaps %ymm3, 32(%rdi)
.loc 1 35 1
vzeroupper
retq
Quite different.
With some modifications to core.internal.array.operations we can get the codegen for kernel4 to be:
0000000000000000 <_D4main7kernel4FKG16fKxG16fKxQgKxQkZv>:
0: 62 d1 7c 48 28 00 vmovaps (%r8),%zmm0
6: 62 f1 7c 48 59 02 vmulps (%rdx),%zmm0,%zmm0
c: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
12: 62 d1 7c 48 58 01 vaddps (%r9),%zmm0,%zmm0
18: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
1e: c5 f8 77 vzeroupper
21: c3 ret
The modifications required are:
Turn on the SIMD helper code (gated by DigitalMars).
Switch regsz from 16 to 64.
Swap out store+load, for their more generic pointer casts. *(cast(vec*)p) = val; return *cast(vec*)p;
Keep dmd's vectorizeable behavior.
With an additional forced inline of arrayOp:
void kernel3(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
o[] = (b[] * a[]) + c[];
}We then get:
0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
0: 62 d1 7c 48 28 00 vmovaps (%r8),%zmm0
6: 62 f1 7c 48 59 02 vmulps (%rdx),%zmm0,%zmm0
c: 62 d1 7c 48 58 01 vaddps (%r9),%zmm0,%zmm0
12: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
18: c5 f8 77 vzeroupper
1b: c3 ret
It looks like llvm cannot combine the multiply + add, but this is significantly better than what it is currently.
Thoughts?