Skip to content

Commit cbf6115

Browse files
committed
Ongoing readme update.
1 parent 592b9fb commit cbf6115

File tree

1 file changed

+53
-45
lines changed

1 file changed

+53
-45
lines changed

README.md

Lines changed: 53 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Croco.Cpp (CCPP) :
1+
# Croco.Cpp (CCPP) - Readme to be updated :
22

33
<details>
44
<summary>Unroll DISCLAIMER:</summary>
@@ -8,7 +8,7 @@ The namechange is due to my boredom with the Frankenstein marker I myself initia
88
As usual, the Croco.Cpp builds are NOT supported by the KoboldCPP (KCPP) team, Github, or Discord channel.
99
They are for greedy-test and amusement only.
1010
Any potential support found them is a courtesy, not a due.
11-
My CCPP version number bumps as soon as the version number in the official experimental branch bumps in the following way x.xxx : (KCPP)x.xx.x.(CCPP)xx.
11+
My CCPP version number bumps as soon as the version number in the official experimental branch bumps in the following way x.xxx (ex : 1.80.1) : (KCPP)x.xxx.(CCPP)xx.
1212
They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
1313
The LllamaCPP version + the additional PRs integrated follow my CCPP versioning in the title, so everybody knows what version they deal with.
1414
Important : New models sometimes integrated in my builds (like recently Mistral Nemo, which posed problems for several users) are for personal testing only, and CAN'T be fixed if they fail because their support come from third party PRs coming from LlamaCPP merged "savagely" in my builds, sometimes before even being merged on LlamaCPP master.
@@ -17,18 +17,28 @@ Important : New models sometimes integrated in my builds (like recently Mistral
1717
Presentation :
1818

1919
Croco.Cpp (CCPP) is a fork of the experimental branch of KoboldCPP (KCPP), mainly aimed at NVidia Cuda users (I'm myself using Ampere GPUs, it MIGHT support the other backends also, everything is compîled but Hipblas/ROCm, but it's not tested), with a few modifications accordingly to my own needs :
20-
- More context steps in GUI, as well as more Blas Batch Size (supports MMVQ 1-8 for example)
21-
- 26 different modes of quantization for the context cache (F16, 20 KV modes with Flash Attention, 5 K modes without Flash Attention for models like Gemma)
20+
- A more cluttered GUI that I had to enlarge to put all my mess.
21+
- More context steps in GUI, as well as more Blas Batch Size (supports MMVQ 1-8 for example).
22+
- Physical Blas Batch Size Exposed and configurable.
23+
- 22 or so different modes of quantization for the context cache (F16, around 15 KV modes with Flash Attention, 7 quantum legacy K cache modes without Flash Attention for models like Gemma).
24+
- KV cache supports IQ4_NL and Q6_0 (except for Gemma), thanks to Ikawrakow.
25+
- Supports inference for B16 models in Cuda (thanks Ikawrakow).
26+
- Supports inference for new quants made by Ikawrakow (Q6_0 legacy for irregularly shaped tensors ; IQ_2K, 3K, 4K, 5K, 6K (first gen) ; IQ2_KS, 4_KSS, 4_KS (second gen, working with IK's reworked MMVQ template) ; IQ2_KT, 3_KT, 4_KT (Trellis, working with a restored DMMV kernel).
27+
- A dozen or so commits taken from Ikawrakow's IK_Llama.CPP for performances (notably on Gemma). That includes a few more GGML ops.
2228
- A slightly different benchmark (one flag per column instead of a single flag space).
23-
- 8 Stories slots instead of 6 in the web-interface (KLite).
24-
- Often some PRs unsupported/not yet supported in KCPP.
29+
- 10 Stories slots instead of 6 in the web-interface (KLite).
30+
- Often some PRs unsupported/not yet supported in KCPP (I look especially at Cuda and KV cache related PRs).
2531
- More infos displayed in the CLI, without activating debug mode.
26-
- Smartcontext instead of contextshift by default in GUI for compatibility with Gemma
27-
- Since 1.71010, an enhanced model layers autoloader on GPU, based on Concedo's code and Pyroserenus formulas, but different from Henky's subsequent commit on KCPP-official. It's compatible with KV_Quants, works in single and multi-GPU, is accessible in CLI and GUI modes, and can be configured easily in tandem with tensor split for an entirely customized loading accordingly to one's rig and needs.
32+
- Smartcontext instead of contextshift by default in GUI for compatibility with Gemma.
33+
- Support the edition of NORM_EPS_RMS value.
34+
- More logging out of debug mode.
35+
- Support EmphasisFSM by Yoshku to handle the "" and ** formatting in KCPP and SillyTavern (mostly, if you have troubles of chat (thoughts, actions, dialogues) formatting, and anti-slop doesn't cut it for your needs somehow).
36+
- Since 1.71010, an enhanced model layers autoloader on GPU (which is less and less cluttered and bugged lol), based on Concedo's code and Pyroserenus formulas, but different from Henky's subsequent commit on KCPP-official. It's compatible with KV_Quants, accounts for FA, MMQ, LowVram, works in single and multi-GPU (up to 16?), is accessible in CLI and GUI modes, and can be configured easily in tandem with tensor split for an entirely customized loading accordingly to one's rig and needs.
37+
2838

2939
Recommanded settings for Commande Line Interface / GUI :
3040
```
31-
--flashattention (except for Gemma)
41+
--flashattention (except for Gemma?)
3242
--blastbatchsize 128 (256 for Gemma)
3343
--usecublas mmq (for NVidia users, MMQ mode is faster)
3444
```
@@ -41,20 +51,23 @@ Check the help section (koboldcpp.exe --help or python koboldcpp.py --help) for
4151

4252
With Flash Attention :
4353
- F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
44-
- K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
45-
- K Q8_0 with : V F16, Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
46-
- K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
47-
- K Q5_0 with : V Q5_0, Q4_1, V Q4_0
48-
- K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
54+
- BF16 (experimental)
55+
- K F16 with : V Q8_0, Q6_0 (experimental), Q5_1, Q5_0, iq4_nl
56+
- K Q8_0 with : V Q8_0 (stable, part of the LCPP/KCPP main triplet), Q6_0 (experimental), Q5_1 (maybe unstable), Q5_0 (maybe unstable), iq4_nl (maybe stable), Q4_0 (maybe stable)
57+
- K Q6_0 with : V Q6_0, Q5_0, iq4_nl
58+
- K Q5_1 with : V Q5_0, iq4_nl
59+
- K Q5_0 with : V iq4_nl
4960
- KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
5061
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.
62+
- KV iq4_nl (with -1% perplexity compared to Q4_0).
5163

5264
Without Flash Attention nor MMQ (for models like Gemma) :
53-
- V F16 with KQ8_0, Q5_1, Q5_0, Q4_1, and Q4_0.
65+
- V F16 with K Q8_0, Q5_1, Q5_0, Q4_1, and Q4_0.
66+
- K Q6_0 and IQ4_NL to be tested, might not work.
5467
</details>
5568

5669
<details>
57-
<summary>Unroll the options to set KV Quants</summary>
70+
<summary>Unroll the options to set KV Quants (obsolete)</summary>
5871

5972
KCPP official KV quantized modes (modes 1 and 2 require Flash Attention) :
6073

@@ -64,35 +77,30 @@ KCPP official KV quantized modes (modes 1 and 2 require Flash Attention) :
6477

6578
CCPP unofficial KV quantized modes (require flash attention) :
6679

67-
3 = FA1680/Kf16-Vq8_0 (12.25BPW),
68-
4 = FA1651/Kf16-Vq5_1 (11BPW),
69-
5 = FA1650/Kf16-Vq5_0 (10.75BPW),
70-
6 = FA1641/Kf16-Vq4_1 (10.5BPW),
71-
7 = FA1640/Kf16-Vq4_0 (10.25BPW),
72-
8 = FA8051/Kq8_0-Vq5_1 (7.25BPW),
73-
9 = FA8050/Kq8_0-Vq5_0 (7BPW),
74-
10 = FA8041/Kq8_0-Vq4_1 (6.75BPW),
75-
11 = FA8040/Kq8_0-Vq4_0 (6.5BPW),
76-
12 = FA5151/KVq5_1 (6BPW),
77-
13 = FA5150/Kq5_1-Vq5_0 (5.75BPW),
78-
14 = FA5141/Kq5_1-Vq4_1 (5.5BPW),
79-
15 = FA5140/Kq5_1-Vq4_0 (5.25BPW),
80-
16 = FA5050/Kq5_0-Vq5_0 (5.5BPW),
81-
17 = FA5041/Kq5_0-Vq4_1 (5.25BPW),
82-
18 = FA5040/Kq5_0-Vq4_0 (5BPW),
83-
19 = FA4141/Kq4_1-Vq4_1 (5BPW),
84-
20 = FA4140/Kq4_1-Vq4_0 (4.75BPW)
85-
86-
21 = 1616/F16 (16 BPW), (same as 0, I just used it for the GUI slider).
87-
88-
22 = 8016/Kq8_0, Vf16 (12.25BPW), FA and no-FA both
89-
90-
23 = 5116/Kq5_1-Vf16 (11BPW), no-FA
91-
24 = 5016/Kq5_1-Vf16 (10.75BPW), no-FA
92-
25 = 4116/Kq4_1-Vf16 (10.50BPW), no-FA
93-
26 = 4016/Kq4_0-Vf16 (10.25BPW), no-FA
94-
95-
choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26], default=0)
80+
"1 - q8_0 - (8.5BPW) - FA",
81+
"2 - q4_0 - (4.5BPW) - FA - possibly faulty on some models",
82+
"3* - K F16 - V q8_0 (12.25BPW) - FA",
83+
"4* - K F16 - V q6_0 (11.25BPW) - FA. Doesn't work on Gemma 2 FA.",
84+
"5 - K q8_0 - V q6_0 (7.5BPW) - FA. Doesn't work on Gemma 2 FA.",
85+
"6* - K q8_0 - V q5_0 (7BPW) - FA",
86+
"7 - K q8_0 - V iq4_nl (6.5BPW) - FA. Doesn't work on Gemma 2 FA.",
87+
"8* - K q6_0 - V q6_0 (6.5BPW) - FA. Doesn't work on Gemma 2 FA.",
88+
"9 - K q6_0 - V q5_0 (6BPW) - FA, best game in FA town. Doesn't work on Gemma 2 FA.",
89+
"10* - K q6_0 - V iq4_nl (5.5BPW) - FA - faulty on some models (Gemma 2 FA. Qwen 2.5 1.5b?)",
90+
"11 - K q5_1 - V q5_0 (5.5BPW) - FA - possibly faulty on some models (Qwen 2.5 1.5b?)",
91+
"12* - K q5_1 - V iq4_nl (5.25BPW) - FA",
92+
"13 - K q5_0 - V iq4_nl (5BPW) - FA - possibly faulty on some models (Qwen 2.5 1.5b?)",
93+
"14 - K iq4_nl - V iq4_nl (4.5BPW) - FA",
94+
"15 - BF16 (16BPW) - no FA, experimental for Cuda, not tested on other backends.",
95+
"16 - K q8_0 - V F16 (12.25BPW) - NO FA, slower",
96+
"17 - K q6_0 - V F16 (11.25BPW) - NO FA, slower, best game in non-FA town.",
97+
"18 - K q5_1 - V F16 (11BPW) - NO FA, slower - possibly faulty on some models (Qwen 2.5 1.5b?)",
98+
"19 - K q5_0 - V F16 (11.75BPW) - NO FA, slower - possibly faulty on some models (Qwen 2.5 1.5b?)",
99+
"20 - K q4_1 - V F16 (10.5BPW) - NO FA, slower - possibly faulty on some models (Qwen 2.5 1.5b?)",
100+
"21 - K q4-0 - V F16 (10.25BPW) - NO FA, slower - possibly faulty on some models (Qwen 2.5 1.5b?)",
101+
"22 - K iq4_nl - V F16 (10.25BPW) - NO FA, slower"]
102+
103+
choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], default=0)
96104
</details>
97105

98106
<details>

0 commit comments

Comments
 (0)