Skip to content

Commit 614c6e6

Browse files
committed
fix: Use Q8_0 for all embedding quantizations for granite and granitemoe
At lower precision levels, the models can manifest numerical instability, especially with batch size > 1. This shows up as nondeterministic stopping when index 0 (the EOG token) has a seemingly uninitialized large value in the logits. Branch: GraniteEmbedQuant Signed-off-by: Gabe Goodhart <[email protected]>
1 parent 3edfa7d commit 614c6e6

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

src/llama-quant.cpp

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_t
155155
const int64_t nx = tensor->ne[0];
156156
const int64_t qk_k = ggml_blck_size(new_type);
157157

158-
if (arch == LLM_ARCH_FALCON || nx % qk_k != 0) {
158+
if (arch == LLM_ARCH_FALCON || arch == LLM_ARCH_GRANITE || arch == LLM_ARCH_GRANITE_MOE || nx % qk_k != 0) {
159159
new_type = GGML_TYPE_Q8_0;
160160
}
161161
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
@@ -171,7 +171,10 @@ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_t
171171
if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
172172
new_type = qs.params->token_embedding_type;
173173
} else {
174-
if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
174+
if (arch == LLM_ARCH_GRANITE || arch == LLM_ARCH_GRANITE_MOE) {
175+
new_type = GGML_TYPE_Q8_0;
176+
}
177+
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
175178
ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
176179
new_type = GGML_TYPE_Q2_K;
177180
}

0 commit comments

Comments
 (0)