CUDA: faster prompt processing for 4-bit quants #713
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is this PR in mainline
llama.cpp
. It uses the__byte_perm
Nvidia intrinsics to more efficiently assemble a 32-bit integer when each byte requires a lookup in a table of 16 values. These are the newly addedMXFP4
, along withIQ4_NL, IQ4_XS, IQ4_KS, IQ4_KS_R4, IQ4_KSS
. I had noticed the__byte_perm
instruction when searching for other SIMD instrinsics in the CUDA manual and had made a mental note to investigate the use of this instruction for handling lookup tables, but mainline PR 15451 already did it, so I could just take it from there.As
IQ4_K
(andIQ4_K_R4
) use blocks of 16, so the first 2 and second 2 bytes require different lookup tables, the trick is not directly applicable to these quantization types.Here a quick performance comparison between the main branch and this PR on RTX-4080
We see noticeable gains for PP. TG is severely memory bandwidth limited on the 4080, so much less impact there (but I wouldn't be surprised if the gain is somewhat larger on a 4090 or 5090).
As I had already done an optimized table lookup for the IQK quants, PP performance gains are somewhat lower than
MXFP4
andIQ4_NL
.