Skip to content

Commit 4f94a61

Browse files
committed
Quantize: specify each major tensor quant in CLI for common LLMs
This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in ggml-org#6239 to : - attn_q.weight - attn_k.weight - attn_v.weight - attn_qkv.weight - attn_output.weight - ffn_gate - ffn_down - ffn_up This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users. For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s. And one is French, like me, so Miqu is one of his main local model. Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality. That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ. But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source. So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant. And that's just an example. I think that many LCPP users could enjoy this feature for their own needs. This, even if it remains quite basic : This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant). CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.
1 parent e849544 commit 4f94a61

File tree

3 files changed

+149
-14
lines changed

3 files changed

+149
-14
lines changed

examples/quantize/quantize.cpp

Lines changed: 71 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = {
5858
{ "F16", LLAMA_FTYPE_MOSTLY_F16, "14.00G, +0.0020 ppl @ Mistral-7B", },
5959
{ "BF16", LLAMA_FTYPE_MOSTLY_BF16, "14.00G, -0.0050 ppl @ Mistral-7B", },
6060
{ "F32", LLAMA_FTYPE_ALL_F32, "26.00G @ 7B", },
61+
{ "CQS", LLAMA_FTYPE_CQS, "Custom Quantization Scheme", },
6162
// Note: Ensure COPY comes after F32 to avoid ftype 0 from matching.
6263
{ "COPY", LLAMA_FTYPE_ALL_F32, "only copy tensors, no quantizing", },
6364
};
@@ -101,19 +102,35 @@ static bool try_parse_ftype(const std::string & ftype_str_in, llama_ftype & ftyp
101102
//
102103
[[noreturn]]
103104
static void usage(const char * executable) {
104-
printf("usage: %s [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]\n\n", executable);
105+
printf("usage: %s [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--attn-q-type] [--attn-k-type] [--attn-v-type] [--attn-qkv-type] [--attn-output-type] [--ffn-gate-type] [--ffn-down-type] [--ffn-up-type] [--keep-split] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]\n\n", executable);
105106
printf(" --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit\n");
106107
printf(" --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing\n");
107108
printf(" --pure: Disable k-quant mixtures and quantize all tensors to the same type\n");
108109
printf(" --imatrix file_name: use data in file_name as importance matrix for quant optimizations\n");
109110
printf(" --include-weights tensor_name: use importance matrix for this/these tensor(s)\n");
110-
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n");
111-
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor\n");
112-
printf(" --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor\n");
113-
printf(" --keep-split: will generate quatized model in the same shards as input");
111+
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n\n");
112+
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor.\n");
113+
printf(" --token-embedding-type ggml_type: use this ggml_type for the token_embd.weight tensor.\n\n");
114+
printf("Additional specific tensor quantization types used in the custom quant scheme 'CQS (default is Q2_K):\n");
115+
printf(" --attn-q-type ggml_type: use this ggml_type for the attn_q.weight tensor.\n");
116+
printf(" --attn-k-type ggml_type: use this ggml_type for the attn_k.weight tensor.\n");
117+
printf(" --attn-v-type ggml_type: use this ggml_type for the attn_v.weight tensor.\n");
118+
printf(" --attn-qkv-type ggml_type: use this ggml_type for the attn_qkv.weight tensor.\n");
119+
printf(" --attn-output-type ggml_type: use this ggml_type for the attn_output.weight tensor.\n");
120+
printf(" --ffn-gate-type ggml_type: use this ggml_type for the ffn_gate tensor.\n");
121+
printf(" --ffn-down-type ggml_type: use this ggml_type for the ffn_down tensor.\n");
122+
printf(" --ffn-up-type ggml_type: use this ggml_type for the ffn_up tensor.\n\n");
123+
printf(" --keep-split: will generate quantized model in the same shards as input\n");
114124
printf(" --override-kv KEY=TYPE:VALUE\n");
115-
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n");
125+
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n\n");
116126
printf("Note: --include-weights and --exclude-weights cannot be used together\n");
127+
printf("Note: The token embeddings tensor is loaded in system RAM, even in case of full GPU/VRAM offload.\n");
128+
printf("Note: The recommanded type for the output tensor is q6_K for the ffn types > iq3_xxs and < q8_0.\n\n");
129+
printf("Note for the Custom Quant Scheme FTYPE:\n");
130+
printf(" Write the specific tensor legacy quants as qN_N, the K-Quants as qN_K, the IQ-Quants as iqN_xx.\n");
131+
printf(" Usually, attn-q-type can be one type below the chosen ffn type, and attn-v-type should be one type above.\n");
132+
printf(" attn-qkv-type replaces the types attn-q, attn-k, and attn-v on some models.\n");
133+
//TODO: - eventually - harmonize the CAPS writing of the FTYPEs, and non CAPS writing of the GGML_TYPEs.
117134
printf("\nAllowed quantization types:\n");
118135
for (auto & it : QUANT_OPTIONS) {
119136
if (it.name != "COPY") {
@@ -267,6 +284,54 @@ int main(int argc, char ** argv) {
267284
} else {
268285
usage(argv[0]);
269286
}
287+
} else if (strcmp(argv[arg_idx], "--attn-q-type") == 0) {
288+
if (arg_idx < argc-1) {
289+
params.attn_q_type = parse_ggml_type(argv[++arg_idx]);
290+
} else {
291+
usage(argv[0]);
292+
}
293+
} else if (strcmp(argv[arg_idx], "--attn-k-type") == 0) {
294+
if (arg_idx < argc-1) {
295+
params.attn_k_type = parse_ggml_type(argv[++arg_idx]);
296+
} else {
297+
usage(argv[0]);
298+
}
299+
} else if (strcmp(argv[arg_idx], "--attn-v-type") == 0) {
300+
if (arg_idx < argc-1) {
301+
params.attn_v_type = parse_ggml_type(argv[++arg_idx]);
302+
} else {
303+
usage(argv[0]);
304+
}
305+
} else if (strcmp(argv[arg_idx], "--attn-qkv-type") == 0) {
306+
if (arg_idx < argc-1) {
307+
params.attn_qkv_type = parse_ggml_type(argv[++arg_idx]);
308+
} else {
309+
usage(argv[0]);
310+
}
311+
} else if (strcmp(argv[arg_idx], "--attn-output-type") == 0) {
312+
if (arg_idx < argc-1) {
313+
params.attn_output_type = parse_ggml_type(argv[++arg_idx]);
314+
} else {
315+
usage(argv[0]);
316+
}
317+
} else if (strcmp(argv[arg_idx], "--ffn-gate-type") == 0) {
318+
if (arg_idx < argc-1) {
319+
params.ffn_gate_type = parse_ggml_type(argv[++arg_idx]);
320+
} else {
321+
usage(argv[0]);
322+
}
323+
} else if (strcmp(argv[arg_idx], "--ffn-down-type") == 0) {
324+
if (arg_idx < argc-1) {
325+
params.ffn_down_type = parse_ggml_type(argv[++arg_idx]);
326+
} else {
327+
usage(argv[0]);
328+
}
329+
} else if (strcmp(argv[arg_idx], "--ffn-up-type") == 0) {
330+
if (arg_idx < argc-1) {
331+
params.ffn_up_type = parse_ggml_type(argv[++arg_idx]);
332+
} else {
333+
usage(argv[0]);
334+
}
270335
} else if (strcmp(argv[arg_idx], "--override-kv") == 0) {
271336
if (arg_idx == argc-1 || !string_parse_kv_override(argv[++arg_idx], kv_overrides)) {
272337
usage(argv[0]);

include/llama.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,8 @@ extern "C" {
172172
LLAMA_FTYPE_MOSTLY_IQ1_XS = 39, // except 1d tensors
173173
LLAMA_FTYPE_MOSTLY_IQ1_XL = 40, // except 1d tensors
174174
LLAMA_FTYPE_MOSTLY_IQ4_XSR = 41, // except 1d tensors
175+
LLAMA_FTYPE_CQS = 99, // except 1d tensors
176+
175177
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
176178
};
177179

@@ -351,6 +353,14 @@ extern "C" {
351353
enum llama_ftype ftype; // quantize to this llama_ftype
352354
enum ggml_type output_tensor_type; // output tensor type
353355
enum ggml_type token_embedding_type; // token embeddings tensor type
356+
enum ggml_type attn_q_type; // attention query tensor type
357+
enum ggml_type attn_k_type; // attention key tensor type
358+
enum ggml_type attn_v_type; // attention value tensor type
359+
enum ggml_type attn_qkv_type; // attention query-key-value tensor type
360+
enum ggml_type attn_output_type; // attention output tensor type
361+
enum ggml_type ffn_gate_type; // feedforward network gate type
362+
enum ggml_type ffn_down_type; // feedforward network down type
363+
enum ggml_type ffn_up_type; // feedforward network up type
354364
bool allow_requantize; // allow quantizing non-f32/f16 tensors
355365
bool quantize_output_tensor; // quantize output.weight
356366
bool only_copy; // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored

src/llama.cpp

Lines changed: 68 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4529,6 +4529,7 @@ static std::string llama_model_ftype_name(llama_ftype ftype) {
45294529
case LLAMA_FTYPE_MOSTLY_Q4_0_4_4: return "Q4_0_4_4";
45304530
case LLAMA_FTYPE_MOSTLY_Q4_0_4_8: return "Q4_0_4_8";
45314531
case LLAMA_FTYPE_MOSTLY_Q4_0_8_8: return "Q4_0_8_8";
4532+
case LLAMA_FTYPE_CQS: return "Custom Quantization Scheme";
45324533

45334534
default: return "unknown, may not work";
45344535
}
@@ -15906,7 +15907,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1590615907
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ4_XSR) new_type = GGML_TYPE_IQ4_XS;
1590715908
}
1590815909
} else if (name.find("attn_v.weight") != std::string::npos) {
15909-
if (qs.model.hparams.n_expert >= 4) {
15910+
if (ftype == LLAMA_FTYPE_CQS && qs.params->attn_v_type < GGML_TYPE_COUNT) {
15911+
new_type = qs.params->attn_v_type;
15912+
}
15913+
else if (qs.model.hparams.n_expert >= 4) {
1591015914
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
1591115915
// TODO: explore better strategies
1591215916
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS ||
@@ -15969,7 +15973,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1596915973
}
1597015974
++qs.i_attention_wv;
1597115975
} else if (name.find("attn_k.weight") != std::string::npos) {
15972-
if (qs.model.hparams.n_expert >= 4) {
15976+
if (ftype == LLAMA_FTYPE_CQS && qs.params->attn_k_type < GGML_TYPE_COUNT) {
15977+
new_type = qs.params->attn_k_type;
15978+
}
15979+
else if (qs.model.hparams.n_expert >= 4) {
1597315980
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
1597415981
// TODO: explore better strategies
1597515982
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS ||
@@ -16022,7 +16029,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1602216029
}
1602316030
++qs.i_attention_wk;
1602416031
} else if (name.find("attn_q.weight") != std::string::npos) {
16025-
if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ2_S;
16032+
if (ftype == LLAMA_FTYPE_CQS && qs.params->attn_q_type < GGML_TYPE_COUNT) {
16033+
new_type = qs.params->attn_q_type;
16034+
}
16035+
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ2_S;
1602616036
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_S) new_type = GGML_TYPE_IQ3_XXS;
1602716037
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
1602816038
ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M ||
@@ -16044,7 +16054,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1604416054
} else if (name.find("ffn_down") != std::string::npos) {
1604516055
auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
1604616056
int i_layer = info.first, n_layer = info.second;
16047-
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L) new_type = GGML_TYPE_Q3_K;
16057+
if (ftype == LLAMA_FTYPE_CQS && qs.params->ffn_down_type < GGML_TYPE_COUNT) {
16058+
new_type = qs.params->ffn_down_type;
16059+
}
16060+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L) new_type = GGML_TYPE_Q3_K;
1604816061
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_Q3_K;
1604916062
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_XS && (i_layer < n_layer/8)) new_type = GGML_TYPE_IQ2_XXS;
1605016063
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
@@ -16105,7 +16118,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1610516118
}
1610616119
++qs.i_ffn_down;
1610716120
} else if (name.find("attn_output.weight") != std::string::npos) {
16108-
if (arch != LLM_ARCH_FALCON) {
16121+
if (ftype == LLAMA_FTYPE_CQS && qs.params->attn_output_type < GGML_TYPE_COUNT) {
16122+
new_type = qs.params->attn_output_type;
16123+
}
16124+
else if (arch != LLM_ARCH_FALCON) {
1610916125
if (qs.model.hparams.n_expert >= 4) {
1611016126
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
1611116127
ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XL || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XL ||
@@ -16143,7 +16159,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1614316159
++qs.i_attention_wo;
1614416160
}
1614516161
else if (name.find("attn_qkv.weight") != std::string::npos) {
16146-
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
16162+
if (ftype == LLAMA_FTYPE_CQS && qs.params->attn_qkv_type < GGML_TYPE_COUNT) {
16163+
new_type = qs.params->attn_qkv_type;
16164+
}
16165+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
1614716166
new_type = GGML_TYPE_Q4_K;
1614816167
}
1614916168
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L) new_type = GGML_TYPE_Q3_K;
@@ -16168,7 +16187,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1616816187
else if (name.find("ffn_gate") != std::string::npos) {
1616916188
auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
1617016189
int i_layer = info.first, n_layer = info.second;
16171-
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_Q3_K;
16190+
if (ftype == LLAMA_FTYPE_CQS && qs.params->ffn_gate_type < GGML_TYPE_COUNT) {
16191+
new_type = qs.params->ffn_gate_type;
16192+
}
16193+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_Q3_K;
1617216194
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S && (i_layer < n_layer/8)) new_type = GGML_TYPE_IQ2_XXS;
1617316195
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_M && (i_layer < n_layer/8)) new_type = GGML_TYPE_IQ2_XXS;
1617416196
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_XL && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_IQ2_XXS;
@@ -16183,7 +16205,10 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1618316205
else if (name.find("ffn_up") != std::string::npos) {
1618416206
auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
1618516207
int i_layer = info.first, n_layer = info.second;
16186-
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_Q3_K;
16208+
if (ftype == LLAMA_FTYPE_CQS && qs.params->ffn_up_type < GGML_TYPE_COUNT) {
16209+
new_type = qs.params->ffn_up_type;
16210+
}
16211+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_L && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_Q3_K;
1618716212
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S && (i_layer < n_layer/8)) new_type = GGML_TYPE_IQ2_XXS;
1618816213
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_M && (i_layer < n_layer/8)) new_type = GGML_TYPE_IQ2_XXS;
1618916214
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_XL && (use_more_bits(i_layer, n_layer))) new_type = GGML_TYPE_IQ2_XXS;
@@ -16347,6 +16372,9 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
1634716372
case LLAMA_FTYPE_MOSTLY_Q4_0_4_8: default_type = GGML_TYPE_Q4_0_4_8; break;
1634816373
case LLAMA_FTYPE_MOSTLY_Q4_0_8_8: default_type = GGML_TYPE_Q4_0_8_8; break;
1634916374

16375+
// Custom Quantization Scheme
16376+
case LLAMA_FTYPE_CQS: default_type = GGML_TYPE_Q2_K; break;
16377+
1635016378
default: throw std::runtime_error(format("invalid output file type %d\n", ftype));
1635116379
}
1635216380

@@ -16605,6 +16633,30 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
1660516633
if (params->output_tensor_type < GGML_TYPE_COUNT && strcmp(tensor->name, "output.weight") == 0) {
1660616634
new_type = params->output_tensor_type;
1660716635
}
16636+
if (params->attn_q_type < GGML_TYPE_COUNT && strcmp(tensor->name, "attn_q.weight") == 0) {
16637+
new_type = params->attn_q_type;
16638+
}
16639+
if (params->attn_k_type < GGML_TYPE_COUNT && strcmp(tensor->name, "attn_k.weight") == 0) {
16640+
new_type = params->attn_k_type;
16641+
}
16642+
if (params->attn_v_type < GGML_TYPE_COUNT && strcmp(tensor->name, "attn_v.weight") == 0) {
16643+
new_type = params->attn_v_type;
16644+
}
16645+
if (params->attn_qkv_type < GGML_TYPE_COUNT && strcmp(tensor->name, "attn_qkv.weight") == 0) {
16646+
new_type = params->attn_qkv_type;
16647+
}
16648+
if (params->attn_output_type < GGML_TYPE_COUNT && strcmp(tensor->name, "attn_output.weight") == 0) {
16649+
new_type = params->attn_output_type;
16650+
}
16651+
if (params->ffn_gate_type < GGML_TYPE_COUNT && strcmp(tensor->name, "ffn_gate") == 0) {
16652+
new_type = params->ffn_gate_type;
16653+
}
16654+
if (params->ffn_down_type < GGML_TYPE_COUNT && strcmp(tensor->name, "ffn_down") == 0) {
16655+
new_type = params->ffn_down_type;
16656+
}
16657+
if (params->ffn_up_type < GGML_TYPE_COUNT && strcmp(tensor->name, "ffn_up") == 0) {
16658+
new_type = params->ffn_up_type;
16659+
}
1660816660

1660916661
// If we've decided to quantize to the same type the tensor is already
1661016662
// in then there's nothing to do.
@@ -17007,6 +17059,14 @@ struct llama_model_quantize_params llama_model_quantize_default_params() {
1700717059
/*.ftype =*/ LLAMA_FTYPE_MOSTLY_Q5_1,
1700817060
/*.output_tensor_type =*/ GGML_TYPE_COUNT,
1700917061
/*.token_embedding_type =*/ GGML_TYPE_COUNT,
17062+
/*.attn_q_type =*/ GGML_TYPE_COUNT,
17063+
/*.attn_k_type =*/ GGML_TYPE_COUNT,
17064+
/*.attn_v_type =*/ GGML_TYPE_COUNT,
17065+
/*.attn_qkv_type =*/ GGML_TYPE_COUNT,
17066+
/*.attn_output_type =*/ GGML_TYPE_COUNT,
17067+
/*.ffn_gate_type =*/ GGML_TYPE_COUNT,
17068+
/*.ffn_down_type =*/ GGML_TYPE_COUNT,
17069+
/*.ffn_up_type =*/ GGML_TYPE_COUNT,
1701017070
/*.allow_requantize =*/ false,
1701117071
/*.quantize_output_tensor =*/ true,
1701217072
/*.only_copy =*/ false,

0 commit comments

Comments
 (0)