Can someone enlighten me on how exactly the Matmul 4bit quantization works? #865
han-minhee
started this conversation in
General
Replies: 1 comment
-
To make my question clearer, I made a simple python function assuming that there are 2 signed int4 values packed into one uint8
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I got int4 quantized Phi3-Mini using builder.py from onnxruntime-genai scripts.
However, I guess there's something I'm missing right now.
When I tried to unpack the quantized value (model.layers.0.attn.qkv_proj.MatMul.weight_Q4 from int4)
The unpacked value didn't match the float32 one (model.layers.0.attn.qkv_proj.MatMul.weight from fp32 onnx)
My goal is to learn how to unpack the values.
For K=3072, N=9216, bits=4, block_size=32, and when the original matrix is B
My understanding is that
B is originally [3072,9216] shaped
B is transposed to [9216, 3072]
Each column is grouped in block_size unit, resulting in [9216, 96, 32] shape
The 32 elements inside one block is scaled using single scale variable, and thus, there are [9216 * 96] scale variables
Two continous scaled float values are then translated into two int4 values and then packed into one uint8_t value.
Here comes the first question:
So, based on my understanding, I tried unpacking the values using the function
(It was implemented regardless of onnxruntime as I wanted to see what's going on)
But the unpacked values are totally different from the original values.
What am I missing?
Thank you in advance!
Assume that for matrix A which has M rows and N columns, [i][j]th data is stored at i * M + j
Beta Was this translation helpful? Give feedback.
All reactions