Skip to content

Commit 7cf1879

Browse files
authored
[HGEMM] Pack sliced_k f16x4/fp16x8 HGEMM (#54)
* [HGEMM] pack f16x8 with bcf * [HGEMM] pack f16x4 with bcf * Update README.md * Update hgemm.cu * Update README.md * Update hgemm.py * Update README.md * Update sgemm.cu * Update hgemm.cu * Update sgemm.cu * Update hgemm.cu * Update hgemm.cu * Update README.md * Update hgemm.py * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update hgemm.cu * Update hgemm.py * Update README.md * Update README.md
1 parent 0c9166d commit 7cf1879

File tree

11 files changed

+1944
-712
lines changed

11 files changed

+1944
-712
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,12 @@
9999
| ✔️ [sgemm_t_8x8_sliced_k_f32x4](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
100100
| ✔️ [sgemm_t_8x8_sliced_k_..._bcf](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
101101
| ✔️ [sgemm_t_8x8_sliced_k_..._dbuf](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
102+
| ✔️ [hgemm_naive_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
102103
| ✔️ [hgemm_sliced_k_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
103104
| ✔️ [hgemm_t_8x8_sliced_k_f16x4](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
105+
| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
106+
| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
107+
| ✔️ [hgemm_t_8x8_sliced_k_..._bcf](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
104108
| ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
105109
| ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
106110
| ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|

dot-product/README.md

Lines changed: 64 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -24,101 +24,101 @@ python3 dot_product.py
2424
```bash
2525
--------------------------------------------------------------------------------
2626
S=1024, K=1024
27-
out_f32f32: -670.21264648 , time:0.08947158ms
28-
out_f32x4f32: -670.21435547 , time:0.02821302ms
29-
out_f32f32_th: -670.21374512 , time:0.09709382ms
27+
out_f32f32: -332.80715942 , time:0.01124835ms
28+
out_f32x4f32: -332.80645752 , time:0.01134133ms
29+
out_f32f32_th: -332.80691528 , time:0.01127815ms
3030
--------------------------------------------------------------------------------
31-
out_f16f32: -670.32208252 , time:0.04000235ms
32-
out_f16x2f32: -670.15814209 , time:0.05491829ms
33-
out_f16x8packf32: -669.90997314 , time:0.01669478ms
34-
out_f16f16_th: -670.50000000 , time:0.02021313ms
31+
out_f16f32: -333.19879150 , time:0.01110196ms
32+
out_f16x2f32: -333.44345093 , time:0.01122665ms
33+
out_f16x8packf32: -333.64193726 , time:0.01099825ms
34+
out_f16f16_th: -332.75000000 , time:0.01118803ms
3535
--------------------------------------------------------------------------------
3636
--------------------------------------------------------------------------------
3737
S=1024, K=2048
38-
out_f32f32: 1040.51086426 , time:0.04557490ms
39-
out_f32x4f32: 1040.50720215 , time:0.06275582ms
40-
out_f32f32_th: 1040.50842285 , time:0.04762864ms
38+
out_f32f32: -142.86260986 , time:0.01630998ms
39+
out_f32x4f32: -142.86064148 , time:0.01116729ms
40+
out_f32f32_th: -142.86035156 , time:0.01143432ms
4141
--------------------------------------------------------------------------------
42-
out_f16f32: 1041.44299316 , time:0.03214121ms
43-
out_f16x2f32: 1041.79589844 , time:0.03448486ms
44-
out_f16x8packf32: 1042.22717285 , time:0.02689457ms
45-
out_f16f16_th: 1041.00000000 , time:0.02859521ms
42+
out_f16f32: -143.31562805 , time:0.01554394ms
43+
out_f16x2f32: -142.84217834 , time:0.01099968ms
44+
out_f16x8packf32: -143.60864258 , time:0.01112890ms
45+
out_f16f16_th: -143.00000000 , time:0.01136470ms
4646
--------------------------------------------------------------------------------
4747
--------------------------------------------------------------------------------
4848
S=1024, K=4096
49-
out_f32f32: -1859.81457520 , time:0.08664179ms
50-
out_f32x4f32: -1859.81628418 , time:0.08621526ms
51-
out_f32f32_th: -1859.81933594 , time:0.08647323ms
49+
out_f32f32: -3116.77270508 , time:0.02791572ms
50+
out_f32x4f32: -3116.77929688 , time:0.01236105ms
51+
out_f32f32_th: -3116.77709961 , time:0.01418424ms
5252
--------------------------------------------------------------------------------
53-
out_f16f32: -1860.23291016 , time:0.05826116ms
54-
out_f16x2f32: -1860.91186523 , time:0.04677963ms
55-
out_f16x8packf32: -1860.25988770 , time:0.04591107ms
56-
out_f16f16_th: -1861.00000000 , time:0.04904127ms
53+
out_f16f32: -3118.24951172 , time:0.02777576ms
54+
out_f16x2f32: -3118.13208008 , time:0.01556611ms
55+
out_f16x8packf32: -3118.15527344 , time:0.01114249ms
56+
out_f16f16_th: -3118.00000000 , time:0.01161337ms
5757
--------------------------------------------------------------------------------
5858
--------------------------------------------------------------------------------
5959
S=2048, K=1024
60-
out_f32f32: 858.98229980 , time:0.04499865ms
61-
out_f32x4f32: 858.98461914 , time:0.04623890ms
62-
out_f32f32_th: 858.98376465 , time:0.06848693ms
60+
out_f32f32: -1549.67492676 , time:0.01551032ms
61+
out_f32x4f32: -1549.67419434 , time:0.01115298ms
62+
out_f32f32_th: -1549.67382812 , time:0.01146293ms
6363
--------------------------------------------------------------------------------
64-
out_f16f32: 858.85339355 , time:0.03274632ms
65-
out_f16x2f32: 858.94274902 , time:0.02831578ms
66-
out_f16x8packf32: 859.46844482 , time:0.02884459ms
67-
out_f16f16_th: 859.00000000 , time:0.03692698ms
64+
out_f16f32: -1549.45434570 , time:0.01545978ms
65+
out_f16x2f32: -1549.04064941 , time:0.01100898ms
66+
out_f16x8packf32: -1549.04748535 , time:0.01111746ms
67+
out_f16f16_th: -1550.00000000 , time:0.01136041ms
6868
--------------------------------------------------------------------------------
6969
--------------------------------------------------------------------------------
7070
S=2048, K=2048
71-
out_f32f32: -1205.77990723 , time:0.08356524ms
72-
out_f32x4f32: -1205.77624512 , time:0.08583307ms
73-
out_f32f32_th: -1205.77807617 , time:0.08613133ms
71+
out_f32f32: -4219.10205078 , time:0.02766943ms
72+
out_f32x4f32: -4219.10009766 , time:0.01223850ms
73+
out_f32f32_th: -4219.10693359 , time:0.01404524ms
7474
--------------------------------------------------------------------------------
75-
out_f16f32: -1205.40588379 , time:0.06001544ms
76-
out_f16x2f32: -1205.29028320 , time:0.04738235ms
77-
out_f16x8packf32: -1205.72924805 , time:0.04624581ms
78-
out_f16f16_th: -1205.00000000 , time:0.04907203ms
75+
out_f16f32: -4218.69335938 , time:0.02764416ms
76+
out_f16x2f32: -4219.42822266 , time:0.01547956ms
77+
out_f16x8packf32: -4219.27929688 , time:0.01113629ms
78+
out_f16f16_th: -4220.00000000 , time:0.01157045ms
7979
--------------------------------------------------------------------------------
8080
--------------------------------------------------------------------------------
8181
S=2048, K=4096
82-
out_f32f32: -893.49169922 , time:0.16136765ms
83-
out_f32x4f32: -893.48596191 , time:0.16174912ms
84-
out_f32f32_th: -893.48901367 , time:0.16518927ms
82+
out_f32f32: -2869.79296875 , time:0.05231595ms
83+
out_f32x4f32: -2869.78149414 , time:0.02043509ms
84+
out_f32f32_th: -2869.78759766 , time:0.02305937ms
8585
--------------------------------------------------------------------------------
86-
out_f16f32: -894.42169189 , time:0.11468077ms
87-
out_f16x2f32: -894.61779785 , time:0.08950567ms
88-
out_f16x8packf32: -895.26538086 , time:0.08448958ms
89-
out_f16f16_th: -894.00000000 , time:0.09156108ms
86+
out_f16f32: -2870.39965820 , time:0.05218816ms
87+
out_f16x2f32: -2871.60571289 , time:0.02775407ms
88+
out_f16x8packf32: -2870.28857422 , time:0.01228762ms
89+
out_f16f16_th: -2870.00000000 , time:0.01509762ms
9090
--------------------------------------------------------------------------------
9191
--------------------------------------------------------------------------------
9292
S=4096, K=1024
93-
out_f32f32: 141.78890991 , time:0.08385873ms
94-
out_f32x4f32: 141.78639221 , time:0.08500123ms
95-
out_f32f32_th: 141.78683472 , time:0.08647728ms
93+
out_f32f32: -1801.87890625 , time:0.02767515ms
94+
out_f32x4f32: -1801.88061523 , time:0.01203156ms
95+
out_f32f32_th: -1801.88317871 , time:0.01396847ms
9696
--------------------------------------------------------------------------------
97-
out_f16f32: 141.80113220 , time:0.05876780ms
98-
out_f16x2f32: 141.62113953 , time:0.04708385ms
99-
out_f16x8packf32: 141.15240479 , time:0.04586506ms
100-
out_f16f16_th: 141.50000000 , time:0.04933500ms
97+
out_f16f32: -1801.71777344 , time:0.02766609ms
98+
out_f16x2f32: -1801.05224609 , time:0.01547670ms
99+
out_f16x8packf32: -1799.91137695 , time:0.01112270ms
100+
out_f16f16_th: -1801.00000000 , time:0.01154137ms
101101
--------------------------------------------------------------------------------
102102
--------------------------------------------------------------------------------
103103
S=4096, K=2048
104-
out_f32f32: -1238.80456543 , time:0.16236329ms
105-
out_f32x4f32: -1238.80737305 , time:0.16246724ms
106-
out_f32f32_th: -1238.80859375 , time:0.16496468ms
104+
out_f32f32: 643.72991943 , time:0.05231857ms
105+
out_f32x4f32: 643.72863770 , time:0.02044320ms
106+
out_f32f32_th: 643.73022461 , time:0.02305865ms
107107
--------------------------------------------------------------------------------
108-
out_f16f32: -1238.78466797 , time:0.11416745ms
109-
out_f16x2f32: -1239.28540039 , time:0.08488607ms
110-
out_f16x8packf32: -1238.85302734 , time:0.08867455ms
111-
out_f16f16_th: -1239.00000000 , time:0.09029007ms
108+
out_f16f32: 644.73352051 , time:0.05214262ms
109+
out_f16x2f32: 644.69067383 , time:0.02766657ms
110+
out_f16x8packf32: 644.65740967 , time:0.01228309ms
111+
out_f16f16_th: 644.00000000 , time:0.01508307ms
112112
--------------------------------------------------------------------------------
113113
--------------------------------------------------------------------------------
114114
S=4096, K=4096
115-
out_f32f32: 556.32690430 , time:0.31692672ms
116-
out_f32x4f32: 556.33087158 , time:0.31752276ms
117-
out_f32f32_th: 556.32879639 , time:0.32040811ms
118-
--------------------------------------------------------------------------------
119-
out_f16f32: 554.45031738 , time:0.23417449ms
120-
out_f16x2f32: 553.61444092 , time:0.16469955ms
121-
out_f16x8packf32: 554.04040527 , time:0.16465998ms
122-
out_f16f16_th: 554.50000000 , time:0.17046404ms
115+
out_f32f32: 7372.59375000 , time:0.17362595ms
116+
out_f32x4f32: 7372.59960938 , time:0.18044138ms
117+
out_f32f32_th: 7372.58251953 , time:0.18282819ms
118+
--------------------------------------------------------------------------------
119+
out_f16f32: 7371.09033203 , time:0.10100150ms
120+
out_f16x2f32: 7371.48632812 , time:0.05214143ms
121+
out_f16x8packf32: 7369.69873047 , time:0.02043009ms
122+
out_f16f16_th: 7372.00000000 , time:0.02451396ms
123123
--------------------------------------------------------------------------------
124124
```

0 commit comments

Comments
 (0)