- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
gguf-py : Numpy dequantization for most types #8939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also adds quantization for Q4_0, Q4_1, Q5_0, and Q5_1. By doing this I've noticed that Q4_0 and Q5_0 (but not the others) have platform-dependant rounding in the reference C version, which depends on whether ggml was compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.
Would something like this work?
diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index d5b91c2d..4c0dd3c8 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -683,11 +683,11 @@ void quantize_row_q4_0_ref(const float * restrict x, block_q4_0 * restrict y, in
         y[i].d = GGML_FP32_TO_FP16(d);
 
         for (int j = 0; j < qk/2; ++j) {
-            const float x0 = x[i*qk + 0    + j]*id;
-            const float x1 = x[i*qk + qk/2 + j]*id;
+            const float x0 = x[i*qk + 0    + j];
+            const float x1 = x[i*qk + qk/2 + j];
 
-            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
-            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+            const uint8_t xi0 = MIN(15, (int8_t)(fmaf(x0, id, 8.5f)));
+            const uint8_t xi1 = MIN(15, (int8_t)(fmaf(x1, id, 8.5f)));
 
             y[i].qs[j]  = xi0;
             y[i].qs[j] |= xi1 << 4;| 
 Yes, using  Maybe something like  But this makes  @ggerganov Since this problem affects rounding in the reference quantization for  | 
| Let's fix it in a separate PR | 
| Related to the FMA rounding of  And that's not all, the scale selection logic in  I was working on quantizing  Now I'm wondering if quantization should really be the same on all platform or not, since FMA does help with reducing some rounding errors (although not much), and it's usually also good for performance when the CPU supports it. Explicitly using FMA everywhere might also work, although a cumulative FMA sum seems very hard to do efficiently in Numpy. And I'm not sure how to disable FMA only for the reference quantization functions. Maybe by putting them in their own file and using  Is platform-independent reproducible quantization worth it? I don't know. It's more complicated than I thought. | 
| Maybe we should disable the auto-FMA contractions all together in the CPU code ( https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract | 
* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants
* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants
* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants
This implements dequantization in Python (using Numpy) for
Q4_0,Q4_1,Q5_0,Q5_1,Q2_K,Q3_K,Q4_K,Q5_K,Q6_K,IQ2_XXS,IQ2_XS,IQ2_S,IQ3_XXS,IQ3_S,IQ1_S,IQ1_M,IQ4_NL, andIQ4_XS, resulting in the samefloat32values as the reference C implementations.This should be useful for #8831
The only types for which dequantization is not implemented are the grouped
Q4_0andQ8_0variants added in #5780 (because I did not find their reference dequantization functions).This also adds quantization for
Q4_0,Q4_1,Q5_0, andQ5_1. By doing this I've noticed thatQ4_0andQ5_0(but not the others) have platform-dependant rounding in the reference C version, which depends on whetherggmlwas compiled with fused-multiply-add or not. The Numpy version does the equivalent of using FMA, but on all platforms. I think the rounding method of these types should be changed eventually.I've verified that all added quantization and dequantization functions result in the same bits as the reference C implementations, by using
gguf-py/tests/test_quants.pywhich I've added for this purpose. It requires buildingggmlwithcmakeandBUILD_SHARED_LIBS.