CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 #716

ikawrakow · 2025-08-21T16:17:16Z

This PR is a follow up of #713, #714, and applies a similar trick to 2-bit quants that need a table lookup (IQ2_K, IQ2_KS, IQ2_K_R4).

model	test	t/s (main)	t/s (PR)	Speedup
llama 8B IQ2_KS	pp512	8673.51 ± 56.38	9289.24 ± 64.59	1.071
llama 8B IQ2_K	pp512	7230.06 ± 37.36	7569.58 ± 64.24	1.047
llama 8B IQ2_K_R4	pp512	7414.71 ± 47.02	7611.86 ± 41.09	1.027
llama 8B IQ2_KS	tg128	178.04 ± 0.16	190.74 ± 0.25	1.071
llama 8B IQ2_K	tg128	183.20 ± 0.24	188.78 ± 0.11	1.030
llama 8B IQ2_K_R4	tg128	172.98 ± 0.21	184.66 ± 0.08	1.068

IQ2_KS is now the new prompt processing speed champion (previous was IQ2_KT).

Iwan Kawrakow added 6 commits August 21, 2025 19:10

Use bperm trick for iq2_ks gemm -> 7% gain

eb488f9

Use bperm trick for iq2_k gemm -> ~5% gain

1d91c16

Use bperm trick for iq2_k_r4 gemm -> ~3% gain

693e9d1

Use bperm trick for iq2_ks gemv -> ~7% gain

353e9ab

Use bperm trick for iq2_k gemv -> ~3% gain

9cf9172

Use bperm trick for iq2_k_r4 gemv -> ~7% gain

01eee24

ikawrakow merged commit dfa6e2b into main Aug 22, 2025

Provide feedback