Commit 64f6c2d
Much faster prompt processing for k-quants (ARM_NEON) (#552)
* iq2_xxs
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
* iq2_xs
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
* iq2_s
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
* iq3_xxs
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
* iq3_s
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
* q2_k
85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.
* q3_K
45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.
* q6_k
47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.
* q4_k
58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.
As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).
* q5_k
54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.
* iq4_xs
71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.
---------
Co-authored-by: Iwan Kawrakow <[email protected]>1 parent ddda4d9 commit 64f6c2d
3 files changed
+725
-18
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
979 | 979 | | |
980 | 980 | | |
981 | 981 | | |
982 | | - | |
| 982 | + | |
983 | 983 | | |
984 | 984 | | |
985 | 985 | | |
| |||
1009 | 1009 | | |
1010 | 1010 | | |
1011 | 1011 | | |
1012 | | - | |
| 1012 | + | |
1013 | 1013 | | |
1014 | 1014 | | |
1015 | 1015 | | |
| |||
1039 | 1039 | | |
1040 | 1040 | | |
1041 | 1041 | | |
1042 | | - | |
| 1042 | + | |
1043 | 1043 | | |
1044 | 1044 | | |
1045 | 1045 | | |
| |||
0 commit comments