1.77003_b3934 - IK new KV quants edition.

Nexesenex · Nexesenex · commit 36874c01c8b1 · 2024-10-22T21:56:34.000+02:00
Milestone version:

- Iwan Kawrakow's amazing work on quantization gives us 2 more K and V quants : iq4_nl, and a whole new q6_0.
- IQ4_NL brings a 1%+ PPL decrease compared to Q4_0, for the same bitrate.
- Q6_0 comes close to Q8_0, at 2BPW less.
Both are offered "as such", and in mix with other quants (K being much more sensitive to quantization than V), in order to have the best quality of inference.
I recommend q6_0/q5_0 for quality, and q5_iq4_nl for savings. q6_iq4_nl being a compromise.

More legacy quants might vanish in the future if IK completes his quants with the support of head sizes other than 128.

Aside that, several other commits of IK made there way into Croco.cpp. Mostly focused on Cuda's performances, of course!

All credits for the benefits go to Ikawrakow, I'm just the laborious mailman!
diff --git a/koboldcpp.py b/koboldcpp.py
@@ -44,10 +44,10 @@
 modelbusy = threading.Lock()
 requestsinqueue = 0
 defaultport = 5001
-KcppVersion = "1.77002"
+KcppVersion = "1.77003"
 LcppVersion = "b3934"
 CudaSpecifics = "CuCML_ArCML_SMC2_DmmvX32Y1"
-ReleaseDate = "2024/10/19"
+ReleaseDate = "2024/10/21"
 showdebug = True
 guimode = False
 showsamplerwarning = True