You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding IQ1_KT - 1.75 bpw SOTA quants (ikawrakow#616)
* iq1_kt: basics
* iq1_kt: CUDA dequantize
Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.
* iq1_kt: CUDA MMQ
* iq1_kt: CUDA MMVQ
* iq1_kt: AVX2 GEMM/GEMV
* iq1_kt: convert/repack to q8_0_r8 (AVX2)
* iq1_kt: slightly faster GEMV
18.6 t/s -> 19.4 t/s
* iq1_kt: NEON GEMM/GEMV
Pathetic as usual
* iq1_kt: slightly faster NEON - still pathetic
* iq1_kt: tiny bit better GEMV on NEON
* iq1_kt: convert/repack to q8_0_r8 (NEON)
* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON
* Adding frgotten file
* iq1_kt: add to constants.py
---------
Co-authored-by: Iwan Kawrakow <[email protected]>
0 commit comments