Skip to content

Commit d62070d

Browse files
committed
x86 asm: move the rest of SIMD from x86-assembly-cheat
1 parent dcd8662 commit d62070d

File tree

7 files changed

+154
-8
lines changed

7 files changed

+154
-8
lines changed

README.adoc

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11927,7 +11927,7 @@ Let's start as usual with floating point addition + register file:
1192711927
Much like ADD for non-SIMD, start learning SIMD instructions by looking at the integer and floating point SIMD ADD instructions of each ISA:
1192811928

1192911929
* x86
11930-
** <<x86-addpd-instruction>>
11930+
** <<x86-sse-data-transfer-instructions,ADDPD>>
1193111931
** <<x86-paddq-instruction>>
1193211932
* arm
1193311933
** <<arm-vadd-instruction>>
@@ -11959,6 +11959,28 @@ as mentioned at:
1195911959

1196011960
Bibliography: https://stackoverflow.com/questions/1389712/getting-started-with-intel-x86-sse-simd-instructions/56409539#56409539
1196111961

11962+
==== FMA instruction
11963+
11964+
Fused multiply add:
11965+
11966+
* x86: <<x86-fma>>
11967+
11968+
Bibliography:
11969+
11970+
* https://en.wikipedia.org/wiki/Multiply–accumulate_operation
11971+
* https://en.wikipedia.org/wiki/FMA_instruction_set
11972+
11973+
Particularly important numerical analysis instruction, that is used in particular for;
11974+
11975+
* Dot product
11976+
* Matrix multiplication
11977+
11978+
FMA is so important that IEEE 754 specifies it with single precision drop compared to a separate add and multiply!
11979+
11980+
Micro-op fun: http://stackoverflow.com/questions/28630864/how-is-fma-implemented
11981+
11982+
Historically, FMA instructions have been added relatively late to instruction sets.
11983+
1196211984
=== User vs system assembly
1196311985

1196411986
By "userland assembly", we mean "the parts of the ISA which can be freely used from userland".
@@ -12858,6 +12880,8 @@ In GCC, you can choose between them with `-mfpmath=`.
1285812880

1285912881
=== x86 SIMD
1286012882

12883+
Parent section: <<simd-assembly>>
12884+
1286112885
History:
1286212886

1286312887
* link:https://en.wikipedia.org/wiki/MMX_(instruction_set)[MMX]: MultiMedia eXtension (unofficial name). 1997. MM0-MM7 64-bit registers.
@@ -12869,22 +12893,51 @@ History:
1286912893
* AVX2:2013
1287012894
* AVX-512: 2016. 512-bit ZMM registers. Extension of YMM.
1287112895

12872-
==== x86 SSE2 instructions
12896+
==== x86 SSE instructions
1287312897

12874-
<<intel-manual-1>> 5.6 "SSE2 INSTRUCTIONS"
12898+
<<intel-manual-1>> 5.5 "SSE INSTRUCTIONS"
1287512899

12876-
===== x86 ADDPD instruction
12900+
===== x86 SSE data transfer instructions
1287712901

12878-
link:userland/arch/x86_64/addpd.S[]: ADDPS, ADDPD
12902+
<<intel-manual-1>> 5.5.1.1 "SSE Data Transfer Instructions"
1287912903

12880-
Good first instruction to learn SIMD: <<simd-assembly>>
12904+
* link:userland/arch/x86_64/movaps.S[]: MOVAPS: move 4 x 32-bits between two XMM registeres or XMM registers and 16-byte aligned memory
12905+
* link:userland/arch/x86_64/movaps.S[]: MOVUPS: like MOVAPS but also works for unaligned memory
12906+
* link:userland/arch/x86_64/movss.S[]: MOVSS: move 32-bits between two XMM registeres or XMM registers and memory
12907+
12908+
===== x86 SSE packed arithmetic instructions
12909+
12910+
<<intel-manual-1>> 5.5.1.2 "SSE Packed Arithmetic Instructions"
12911+
12912+
* link:userland/arch/x86_64/addpd.S[]: ADDPS, ADDPD: good first instruction to learn SIMD: <<simd-assembly>>
12913+
12914+
===== x86 SSE conversion instructions
12915+
12916+
<<intel-manual-1>> 5.5.1.6 "SSE Conversion Instructions"
12917+
12918+
==== x86 SSE2 instructions
12919+
12920+
<<intel-manual-1>> 5.6 "SSE2 INSTRUCTIONS"
12921+
12922+
* link:userland/arch/x86_64/cvttss2si.S[]: CVTTSS2SI: convert 32-bit floating point to 32-bit integer, store the result in a general purpose register. Round towards 0.
1288112923

1288212924
===== x86 PADDQ instruction
1288312925

1288412926
link:userland/arch/x86_64/paddq.S[]: PADDQ, PADDL, PADDW, PADDB
1288512927

1288612928
Good first instruction to learn SIMD: <<simd-assembly>>
1288712929

12930+
[[x86-fma]]
12931+
==== x86 fused multiply add (FMA)
12932+
12933+
<<intel-manual-1>> 5.15 "FUSED-MULTIPLY-ADD (FMA)"
12934+
12935+
* link:userland/arch/x86_64/vfmadd132pd.S[]: VFMADD132PD: "Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1." TODO: but I don't understand the manual, experimentally on <<p51>> Ubuntu 19.04 host the result is stored in XMM2!
12936+
12937+
These instructions were not part of any SSEn set: they actually have a dedicated CPUID flag for it! It appears under `/proc/cpuinfo` as `fma`. They were introduced into AVX512F however.
12938+
12939+
They are also unusual for x86 instructions in that they take 3 operands, as you would intuitively expect from the definition of FMA.
12940+
1288812941
=== x86 system instructions
1288912942

1289012943
<<intel-manual-1>> 5.20 "SYSTEM INSTRUCTIONS"
@@ -13630,6 +13683,8 @@ Why GNU GAS 2.29 does not have a mnemonic for it in A64 because it is very recen
1363013683

1363113684
=== ARM SIMD
1363213685

13686+
Parent section: <<simd-assembly>>
13687+
1363313688
==== ARM VFP
1363413689

1363513690
The name for the ARMv7 and AArch32 floating point and SIMD instructions / registers.

userland/arch/x86_64/addpd.S

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-addpd-instruction
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-sse-packed-arithmetic-instructions
22
*
3-
* Add a few floating point numbers in one go (P == packaged)
3+
* Add a few floating point numbers in one go (P == packaged).
44
*/
55

66
#include <lkmc.h>

userland/arch/x86_64/cvttss2si.S

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-sse-packed-arithmetic-instructions */
2+
3+
#include <lkmc.h>
4+
5+
LKMC_PROLOGUE
6+
.data
7+
.align 16
8+
input_2_5: .float 2.5
9+
input_minus_2_5: .float -2.5
10+
.text
11+
/* Positive input. */
12+
movss input_2_5, %xmm0
13+
cvttss2si %xmm0, %eax
14+
LKMC_ASSERT_EQ_32(%eax, $2)
15+
16+
/* Negative input. */
17+
movss input_minus_2_5, %xmm0
18+
cvttss2si %xmm0, %eax
19+
LKMC_ASSERT_EQ_32(%eax, $-2)
20+
LKMC_EPILOGUE

userland/arch/x86_64/movaps.S

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-sse-packed-arithmetic-instructions */
2+
3+
#include <lkmc.h>
4+
5+
LKMC_PROLOGUE
6+
.data
7+
/* Ensure that the memory is 16-byte aligned. */
8+
.align 16
9+
input: .float 1.5, 2.5, 3.5, 4.5
10+
.bss
11+
.align 16
12+
output: .skip 16
13+
.text
14+
movaps input, %xmm0
15+
movaps %xmm0, %xmm1
16+
movaps %xmm1, output
17+
LKMC_ASSERT_MEMCMP(input, output, $16)
18+
LKMC_EPILOGUE

userland/arch/x86_64/movss.S

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-sse-data-transfer-instructions */
2+
3+
#include <lkmc.h>
4+
5+
.data
6+
input: .float 1.5
7+
.bss
8+
output: .skip 4
9+
LKMC_PROLOGUE
10+
movss input, %xmm0
11+
movss %xmm0, %xmm1
12+
movss %xmm1, output
13+
LKMC_ASSERT_MEMCMP(input, output, $4)
14+
LKMC_EPILOGUE

userland/arch/x86_64/movups.S

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-sse-packed-arithmetic-instructions */
2+
3+
#include <lkmc.h>
4+
5+
LKMC_PROLOGUE
6+
.data
7+
/* Unlike MOVAPS, we don't need to align memory here. */
8+
input: .float 1.5, 2.5, 3.5, 4.5
9+
.bss
10+
output: .skip 16
11+
.text
12+
movups input, %xmm0
13+
movups %xmm0, %xmm1
14+
movups %xmm1, output
15+
LKMC_ASSERT_MEMCMP(input, output, $16)
16+
LKMC_EPILOGUE

userland/arch/x86_64/vfmadd132pd.S

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-fma */
2+
3+
#include <lkmc.h>
4+
5+
LKMC_PROLOGUE
6+
.data
7+
.align 16
8+
input0: .double 1.5, 2.5
9+
input1: .double 2.0, 4.0
10+
input2: .double 2.5, 3.5
11+
expect: .double 6.5, 16.5
12+
.bss
13+
.align 16
14+
output: .skip 16
15+
.text
16+
movaps input1, %xmm0
17+
movaps input0, %xmm1
18+
movaps input2, %xmm2
19+
/* xmm2 = xmm1 + (xmm0 * xmm2) */
20+
vfmadd132pd %xmm0, %xmm1, %xmm2
21+
movaps %xmm2, output
22+
LKMC_ASSERT_MEMCMP(output, expect, $0x10)
23+
LKMC_EPILOGUE

0 commit comments

Comments
 (0)