Fix missed block_q8_x2 bf16 -> i16 change #540
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #538
The story behind this bug:
Many years ago, the committee designing the
AVXinstruction set decided to use the most unhelpful instruction for performing dot products betweenint8SIMD vectors: the left operand in the instruction had to be an unsigned integer. That decision propagated intoAVX2andAVX512. When using this in the context of quantized LLMs, where quantized model weights are signed integers, we have two options to deal with this situation:Option 2 is faster, but cannot be used on
AVX2when the quants span the fullint8_trange as the dot product produces a SIMD vector withint16_tvalues containing the sum of pairs, and that can overflow (e.g., 255127 + 255127). But onAVX512the dot product sums 4 products into anint32_tavoiding overflow in intermediate results, so we use the faster option 2. For this we have theQ8_1type, which contains the block scale and the sum of the quants in the block times the block scale asfp16. This worked fine until DeepSeek came along, and we started getting NaNs because the sum was occasionally overflowing thefp16range. We then switched to usingQ8_2, which is the sameQ8_1, except that block scale and sum are stored asbf16, which resolved the NaNs with DeepSeek. But when working on PR #534, I noticed that PPL forQ4_0became significantly higher, and that was due to not enough precision in thebf16block sum. So, I changed again to have the block sum stored asint16_t(which is exact), and then converted tofp32at run time. I thought I did adapt all places whereQ8_2orQ8_2_X4is used, but no, I missed one place in the tail of theQ8_0_R8 x Q8_2_X4dot product. In that product we go over groups of 4 blocks of 32 quants, and then have a tail handling the leftover. In the vast majority of cases there are no leftovers, but in the DeepSeek FlashMLA, we run into this forgotten corner. The PR fixes that.