-
Notifications
You must be signed in to change notification settings - Fork 13.4k
CUDA: Accelerate MXFP4 table lookup using __byte_perm
#15451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While HIP also supports |
@IMbackK
On CUDA, __byte_perm(x, y, s) maps directly to the PRMT (byte permute) instruction on NVIDIA GPUs. That instruction can pick arbitrary bytes from two 32-bit registers and shuffle them into a single 32-bit result — in one cycle on modern NVIDIA SMs. So on NVIDIA, this is a native instruction.
HIP provides __byte_perm as part of its CUDA compatibility, but there’s no direct equivalent PRMT instruction on GCN/RDNA ISAs. The compiler has to emulate the operation: usually with a sequence of bitfield extract (BFE), shifts, masks, and ORs. That expands into multiple instructions (5–12+ depending on pattern). Throughput and latency are much worse than the single-cycle CUDA version. |
ggml/src/ggml-cuda/vecdotq.cuh
Outdated
} | ||
|
||
static __device__ __forceinline__ int2 get_int_from_table_16(const int & q4, const int8_t * table) { | ||
#if __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the code only enabled for Pascal and newer? According to the PTX documentation the instruction should be available on all NVIDIA GPUs supported by llama.cpp/ggml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right.
ggml/src/ggml-cuda/vecdotq.cuh
Outdated
mask = (0x32103210 | ((q4 & 0x88888888) >> 1)); | ||
v1 = __byte_perm(values[0], values[1], q4); | ||
v2 = __byte_perm(values[2], values[3], q4); | ||
v3 = __byte_perm(v1, v2, mask); | ||
v1 = __byte_perm(values[0], values[1], q4 >> 16); | ||
v2 = __byte_perm(values[2], values[3], q4 >> 16); | ||
v4 = __byte_perm(v1, v2, mask >> 16); | ||
|
||
return make_int2(__byte_perm(v3, v4, 0x6420), __byte_perm(v3, v4, 0x7531)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments explaining how this works.
Please dont with the chatgpt explanations. But in this case it happens to be correct, yes. |
actually while This pr mostly works because the cuda compiler is doing pretty poorly here. |
5090:
|
Using v_perm_b32 directly appears to provide some performance improvements on AMD GPUs either.
tested on RX 9070 in wsl pr
master
|
Thats exactly what i tried an the perf impact was hugely negativ in cDNA, so we need to gate this @ v_Perm_b32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are I think bad comments. They don't actually explain how __byte_perm
works to replace the lookup using indices, the data layout after each step, and why e.g. mask is being constructed the way it is. Just remove the comments again and fix the AMD code path, then I'll test performance on my own hardware and write the comments myself.
Actually i retract that, while what you did and what i tried are almost the same (only differ slightly in order), what i tried was perf negative but what you added to the pr is performance netural or very slightly positive on CDNA - strange. |
Master:
This branch:
waiting for merge! |
7720366
to
741934c
Compare
I pushed a version with revised comments and a rebase on master. Performance
|
Performance has improved a lot. Non-rigorous Performance Comparison for gpt-oss 20B MXFP4 MoEggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Batch Size Comparison
Same talk Comparison
|
* CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
…-org#15451)" This reverts commit 74f52f7.
This PR accelerates MXFP4 vector dot product calculations by optimizing the
get_int_from_table_16
function.The previous table lookup logic is replaced with the more efficient
__byte_perm
CUDA intrinsic.This results in a significant performance improvement for gpt-oss using the MXFP4 format on NVIDIA GPUs.
Benchmarks (RTX 3080 Ti, MXFP4 gpt-oss-20b):
Master Branch:
This PR: