Fix missed block_q8_x2 bf16 -> i16 change #540
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #538
The story behind this bug:
Many years ago, the committee designing the
AVX
instruction set decided to use the most unhelpful instruction for performing dot products betweenint8
SIMD vectors: the left operand in the instruction had to be an unsigned integer. That decision propagated intoAVX2
andAVX512
. When using this in the context of quantized LLMs, where quantized model weights are signed integers, we have two options to deal with this situation:Option 2 is faster, but cannot be used on
AVX2
when the quants span the fullint8_t
range as the dot product produces a SIMD vector withint16_t
values containing the sum of pairs, and that can overflow (e.g., 255127 + 255127). But onAVX512
the dot product sums 4 products into anint32_t
avoiding overflow in intermediate results, so we use the faster option 2. For this we have theQ8_1
type, which contains the block scale and the sum of the quants in the block times the block scale asfp16
. This worked fine until DeepSeek came along, and we started getting NaNs because the sum was occasionally overflowing thefp16
range. We then switched to usingQ8_2
, which is the sameQ8_1
, except that block scale and sum are stored asbf16
, which resolved the NaNs with DeepSeek. But when working on PR #534, I noticed that PPL forQ4_0
became significantly higher, and that was due to not enough precision in thebf16
block sum. So, I changed again to have the block sum stored asint16_t
(which is exact), and then converted tofp32
at run time. I thought I did adapt all places whereQ8_2
orQ8_2_X4
is used, but no, I missed one place in the tail of theQ8_0_R8 x Q8_2_X4
dot product. In that product we go over groups of 4 blocks of 32 quants, and then have a tail handling the leftover. In the vast majority of cases there are no leftovers, but in the DeepSeek FlashMLA, we run into this forgotten corner. The PR fixes that.