CUDA: Accelerate MXFP4 table lookup using `__byte_perm` #15451

Qeeweew · 2025-08-20T10:33:32Z

This PR accelerates MXFP4 vector dot product calculations by optimizing the get_int_from_table_16 function.

The previous table lookup logic is replaced with the more efficient __byte_perm CUDA intrinsic.

This results in a significant performance improvement for gpt-oss using the MXFP4 format on NVIDIA GPUs.

Benchmarks (RTX 3080 Ti, MXFP4 gpt-oss-20b):

Master Branch:

CUDA_VISIBLE_DEVICES=1 ./build_origin/bin/llama-bench -m ~/workspace/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      3719.22 ± 51.68 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        153.88 ± 0.26 |

build: a094f3814 (6210)

This PR:

CUDA_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m ~/workspace/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      4127.83 ± 29.88 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        176.38 ± 0.12 |

build: bd4b6cd23 (6211)

IMbackK · 2025-08-20T11:34:20Z

While HIP also supports __byte_perm performance impact is negative there.

lovedheart · 2025-08-20T12:55:01Z

@IMbackK
ChatGPT's answer:

What __byte_perm does

On CUDA, __byte_perm(x, y, s) maps directly to the PRMT (byte permute) instruction on NVIDIA GPUs.

That instruction can pick arbitrary bytes from two 32-bit registers and shuffle them into a single 32-bit result — in one cycle on modern NVIDIA SMs.

So on NVIDIA, this is a native instruction.

On AMD / HIP

HIP provides __byte_perm as part of its CUDA compatibility, but there’s no direct equivalent PRMT instruction on GCN/RDNA ISAs.

The compiler has to emulate the operation: usually with a sequence of bitfield extract (BFE), shifts, masks, and ORs.

That expands into multiple instructions (5–12+ depending on pattern).

Throughput and latency are much worse than the single-cycle CUDA version.

JohannesGaessler · 2025-08-20T13:57:01Z

ggml/src/ggml-cuda/vecdotq.cuh

 }

 static __device__ __forceinline__ int2 get_int_from_table_16(const int & q4, const int8_t * table) {
+#if __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL


Why is the code only enabled for Pascal and newer? According to the PTX documentation the instruction should be available on all NVIDIA GPUs supported by llama.cpp/ggml.

I think you're right.

JohannesGaessler · 2025-08-20T13:57:30Z

ggml/src/ggml-cuda/vecdotq.cuh

+    mask = (0x32103210 | ((q4 & 0x88888888) >> 1));
+    v1 = __byte_perm(values[0], values[1], q4);
+    v2 = __byte_perm(values[2], values[3], q4);
+    v3 = __byte_perm(v1, v2, mask);
+    v1 = __byte_perm(values[0], values[1], q4 >> 16);
+    v2 = __byte_perm(values[2], values[3], q4 >> 16);
+    v4 = __byte_perm(v1, v2, mask >> 16);
+
+    return make_int2(__byte_perm(v3, v4, 0x6420), __byte_perm(v3, v4, 0x7531));


Please add comments explaining how this works.

IMbackK · 2025-08-20T14:43:06Z

@IMbackK ChatGPT's answer:
1. What __byte_perm does
On CUDA, __byte_perm(x, y, s) maps directly to the PRMT (byte permute) instruction on NVIDIA GPUs.

That instruction can pick arbitrary bytes from two 32-bit registers and shuffle them into a single 32-bit result — in one cycle on modern NVIDIA SMs.

So on NVIDIA, this is a native instruction.
2. On AMD / HIP
HIP provides __byte_perm as part of its CUDA compatibility, but there’s no direct equivalent PRMT instruction on GCN/RDNA ISAs.

The compiler has to emulate the operation: usually with a sequence of bitfield extract (BFE), shifts, masks, and ORs.

That expands into multiple instructions (5–12+ depending on pattern).

Throughput and latency are much worse than the single-cycle CUDA version.

Please dont with the chatgpt explanations. But in this case it happens to be correct, yes.

IMbackK · 2025-08-20T15:43:35Z

actually while __byte_perm is indeed implemented as bitshifts in hip, AMDGCN has a similar instruction that could be used: V_PERM_B32. The compiler is already using V_PERM_B32 for this function and a quick implementation that uses V_PERM_B32 by hand like this pr dose for nvidia is worse than what the compiler is doing.

This pr mostly works because the cuda compiler is doing pretty poorly here.

slaren · 2025-08-20T16:00:16Z

5090:

Model	Test	t/s master	t/s byte_perm_opt	Speedup
gpt-oss 20B MXFP4 MoE	pp512	7903.90	8684.19	1.10
gpt-oss 20B MXFP4 MoE	tg128	252.77	265.05	1.05
llama 7B IQ4_NL - 4.5 bpw	pp512	14013.73	15617.30	1.11
llama 7B IQ4_NL - 4.5 bpw	tg128	246.33	250.93	1.02

Qeeweew · 2025-08-21T04:06:12Z

Using v_perm_b32 directly appears to provide some performance improvements on AMD GPUs either.

./test-backend-ops perf -o MUL_MAT

tested on RX 9070 in wsl

pr

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
Testing 2 devices

Backend 1/2: ROCm0
  Device description: AMD Radeon RX 9070
  Device memory: 16304 MB (13121 MB free)
  
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              28968 runs -    35.34 us/run - 117.44 MFLOP/run -   3.32 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):             27264 runs -    36.85 us/run - 117.44 MFLOP/run -   3.19 TFLOPS

master

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
Testing 2 devices

Backend 1/2: ROCm0
  Device description: AMD Radeon RX 9070
  Device memory: 16304 MB (13121 MB free)

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    34.26 us/run - 117.44 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):             18744 runs -    55.62 us/run - 117.44 MFLOP/run -   2.11 TFLOPS

IMbackK · 2025-08-21T06:39:13Z

Thats exactly what i tried an the perf impact was hugely negativ in cDNA, so we need to gate this @ v_Perm_b32

JohannesGaessler

These are I think bad comments. They don't actually explain how __byte_perm works to replace the lookup using indices, the data layout after each step, and why e.g. mask is being constructed the way it is. Just remove the comments again and fix the AMD code path, then I'll test performance on my own hardware and write the comments myself.

IMbackK · 2025-08-21T08:36:37Z

Thats exactly what i tried an the perf impact was hugely negativ in cDNA, so we need to gate this @ v_Perm_b32

Actually i retract that, while what you did and what i tried are almost the same (only differ slightly in order), what i tried was perf negative but what you added to the pr is performance netural or very slightly positive on CDNA - strange.
Anyhow no need to gate it after all.

drrros · 2025-08-24T21:15:23Z

Master:

drros@tesla:~/llama.cpp$ ./build/bin/llama-bench --model /mnt/ds1nfs/codellamaweights/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 4096 -p 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |       3830.22 ± 6.69 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |           tg128 |        138.25 ± 0.17 |

build: b730706a (6263)

This branch:

drros@tesla:~/llama.cpp$ ./build/bin/llama-bench --model /mnt/ds1nfs/codellamaweights/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 4096 -p 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |       4083.86 ± 7.69 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |           tg128 |        158.43 ± 0.22 |

build: 7720366f (6212)

waiting for merge!

JohannesGaessler · 2025-08-25T19:21:55Z

I pushed a version with revised comments and a rebase on master.

Performance

GPU	Model	Microbatch size	Test	t/s master	t/s `741934c`	Speedup
P40	gpt-oss 20B MXFP4 MoE	1	pp512	63.60	81.98	1.29
P40	gpt-oss 20B MXFP4 MoE	2	pp512	73.88	100.61	1.36
P40	gpt-oss 20B MXFP4 MoE	4	pp512	114.14	156.89	1.37
P40	gpt-oss 20B MXFP4 MoE	8	pp512	173.84	238.82	1.37
P40	gpt-oss 20B MXFP4 MoE	16	pp512	267.70	393.48	1.47
P40	gpt-oss 20B MXFP4 MoE	32	pp512	385.46	482.63	1.25
P40	gpt-oss 20B MXFP4 MoE	64	pp512	497.37	581.00	1.17
P40	gpt-oss 20B MXFP4 MoE	128	pp512	736.88	849.49	1.15
P40	gpt-oss 20B MXFP4 MoE	256	pp512	1024.58	1166.61	1.14
P40	gpt-oss 20B MXFP4 MoE	512	pp512	1248.05	1398.47	1.12
3x P40	gpt-oss 120B MXFP4 MoE	1	pp512	58.18	74.92	1.29
3x P40	gpt-oss 120B MXFP4 MoE	2	pp512	59.81	79.02	1.32
3x P40	gpt-oss 120B MXFP4 MoE	4	pp512	89.75	120.83	1.35
3x P40	gpt-oss 120B MXFP4 MoE	8	pp512	129.05	176.00	1.36
3x P40	gpt-oss 120B MXFP4 MoE	16	pp512	183.17	272.22	1.49
3x P40	gpt-oss 120B MXFP4 MoE	32	pp512	232.30	293.13	1.26
3x P40	gpt-oss 120B MXFP4 MoE	64	pp512	260.81	306.50	1.18
3x P40	gpt-oss 120B MXFP4 MoE	128	pp512	347.60	405.86	1.17
3x P40	gpt-oss 120B MXFP4 MoE	256	pp512	455.47	527.44	1.16
3x P40	gpt-oss 120B MXFP4 MoE	512	pp512	515.93	590.76	1.15
RTX 3090	gpt-oss 20B MXFP4 MoE	1	pp512	177.17	204.53	1.15
RTX 3090	gpt-oss 20B MXFP4 MoE	2	pp512	164.40	192.74	1.17
RTX 3090	gpt-oss 20B MXFP4 MoE	4	pp512	274.68	325.91	1.19
RTX 3090	gpt-oss 20B MXFP4 MoE	8	pp512	447.18	529.76	1.18
RTX 3090	gpt-oss 20B MXFP4 MoE	16	pp512	730.44	865.66	1.19
RTX 3090	gpt-oss 20B MXFP4 MoE	32	pp512	1128.88	1347.95	1.19
RTX 3090	gpt-oss 20B MXFP4 MoE	64	pp512	1570.72	1781.12	1.13
RTX 3090	gpt-oss 20B MXFP4 MoE	128	pp512	2001.14	2175.27	1.09
RTX 3090	gpt-oss 20B MXFP4 MoE	256	pp512	3027.16	3286.91	1.09
RTX 3090	gpt-oss 20B MXFP4 MoE	512	pp512	3948.40	4260.13	1.08
RTX 4090	gpt-oss 20B MXFP4 MoE	1	pp512	275.77	278.50	1.01
RTX 4090	gpt-oss 20B MXFP4 MoE	2	pp512	253.80	282.86	1.11
RTX 4090	gpt-oss 20B MXFP4 MoE	4	pp512	439.42	489.87	1.11
RTX 4090	gpt-oss 20B MXFP4 MoE	8	pp512	737.88	816.85	1.11
RTX 4090	gpt-oss 20B MXFP4 MoE	16	pp512	1231.27	1367.45	1.11
RTX 4090	gpt-oss 20B MXFP4 MoE	32	pp512	2028.55	2315.94	1.14
RTX 4090	gpt-oss 20B MXFP4 MoE	64	pp512	3032.18	3376.37	1.11
RTX 4090	gpt-oss 20B MXFP4 MoE	128	pp512	4164.72	4513.75	1.08
RTX 4090	gpt-oss 20B MXFP4 MoE	256	pp512	6452.21	7011.96	1.09
RTX 4090	gpt-oss 20B MXFP4 MoE	512	pp512	8939.46	9737.98	1.09
3x RTX 4090	gpt-oss 120B MXFP4 MoE	1	pp512	254.65	257.01	1.01
3x RTX 4090	gpt-oss 120B MXFP4 MoE	2	pp512	216.32	242.79	1.12
3x RTX 4090	gpt-oss 120B MXFP4 MoE	4	pp512	356.45	402.53	1.13
3x RTX 4090	gpt-oss 120B MXFP4 MoE	8	pp512	560.90	630.59	1.12
3x RTX 4090	gpt-oss 120B MXFP4 MoE	16	pp512	875.75	989.65	1.13
3x RTX 4090	gpt-oss 120B MXFP4 MoE	32	pp512	1278.09	1490.90	1.17
3x RTX 4090	gpt-oss 120B MXFP4 MoE	64	pp512	1663.85	1902.12	1.14
3x RTX 4090	gpt-oss 120B MXFP4 MoE	128	pp512	1921.13	2121.58	1.10
3x RTX 4090	gpt-oss 120B MXFP4 MoE	256	pp512	2808.07	3090.01	1.10
3x RTX 4090	gpt-oss 120B MXFP4 MoE	512	pp512	3569.59	3928.72	1.10
RX 6800	gpt-oss 20B MXFP4 MoE	1	pp512	75.25	78.37	1.04
RX 6800	gpt-oss 20B MXFP4 MoE	2	pp512	71.04	78.84	1.11
RX 6800	gpt-oss 20B MXFP4 MoE	4	pp512	109.57	119.37	1.09
RX 6800	gpt-oss 20B MXFP4 MoE	8	pp512	157.45	169.47	1.08
RX 6800	gpt-oss 20B MXFP4 MoE	16	pp512	232.47	262.01	1.13
RX 6800	gpt-oss 20B MXFP4 MoE	32	pp512	312.19	335.07	1.07
RX 6800	gpt-oss 20B MXFP4 MoE	64	pp512	374.56	389.75	1.04
RX 6800	gpt-oss 20B MXFP4 MoE	128	pp512	546.24	566.68	1.04
RX 6800	gpt-oss 20B MXFP4 MoE	256	pp512	714.39	737.59	1.03
RX 6800	gpt-oss 20B MXFP4 MoE	512	pp512	786.63	807.11	1.03

mizuikk · 2025-08-26T09:33:39Z

Performance has improved a lot.

Non-rigorous Performance Comparison for gpt-oss 20B MXFP4 MoE

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

Batch Size Comparison

n_batch	n_ubatch	Test	t/s (Build `e92734d` (6250))	t/s (Build `39842a7` (6281))	Increment (%)
4096	2048	pp2048	4033.44 ± 128.36	4080.68 ± 128.59	1.17
4096	2048	pp8192	3572.80 ± 4.06	3608.89 ± 3.72	1.01
4096	2048	pp16384	3043.20 ± 3.20	3078.45 ± 4.16	1.16
4096	2048	pp32768	2352.16 ± 0.54	2382.57 ± 0.32	1.29
4096	2048	tg128	125.71 ± 0.02	139.66 ± 0.02	11.10
4096	4096	pp2048	4089.27 ± 11.14	4126.88 ± 10.73	0.92
4096	4096	pp8192	3767.56 ± 10.57	3791.91 ± 10.15	0.65
4096	4096	pp16384	3184.26 ± 2.14	3210.55 ± 1.21	0.83
4096	4096	pp32768	2428.32 ± 1.06	2394.49 ± 107.44	-1.39
4096	4096	tg128	125.55 ± 0.02	139.57 ± 0.05	11.16

Same talk Comparison

t/s (Build `e92734d` (6250))	t/s (Build `39842a7` (6281))	Increment (%)
66.76	71.27	6.33

* CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

…-org#15451)" This reverts commit 74f52f7.

ggerganov requested a review from JohannesGaessler August 20, 2025 10:36

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025

JohannesGaessler reviewed Aug 20, 2025

View reviewed changes

JohannesGaessler reviewed Aug 21, 2025

View reviewed changes

ikawrakow mentioned this pull request Aug 21, 2025

CUDA: faster prompt processing for 4-bit quants ikawrakow/ik_llama.cpp#713

Merged

Qweewee and others added 3 commits August 25, 2025 19:55

CUDA: optimize get_int_from_table_16

43ecc1a

CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

f011aac

revise documentation

741934c

JohannesGaessler force-pushed the byte_perm_opt branch from 7720366 to 741934c Compare August 25, 2025 19:18

JohannesGaessler approved these changes Aug 25, 2025

View reviewed changes

IMbackK approved these changes Aug 25, 2025

View reviewed changes

JohannesGaessler merged commit 74f52f7 into ggml-org:master Aug 25, 2025
44 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "CUDA: Accelerate MXFP4 table lookup using __byte_perm (ggml…

78dc93d

…-org#15451)" This reverts commit 74f52f7.

CUDA: Accelerate MXFP4 table lookup using __byte_perm #15451

CUDA: Accelerate MXFP4 table lookup using __byte_perm #15451

Uh oh!

Conversation

Qeeweew commented Aug 20, 2025

Uh oh!

IMbackK commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lovedheart commented Aug 20, 2025

Uh oh!

JohannesGaessler Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Qeeweew Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Aug 20, 2025

Uh oh!

IMbackK commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

Qeeweew commented Aug 21, 2025

Uh oh!

IMbackK commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Aug 21, 2025

Uh oh!

drrros commented Aug 24, 2025

Uh oh!

JohannesGaessler commented Aug 25, 2025

Uh oh!

Uh oh!

mizuikk commented Aug 26, 2025

Batch Size Comparison

Same talk Comparison

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

CUDA: Accelerate MXFP4 table lookup using `__byte_perm` #15451

CUDA: Accelerate MXFP4 table lookup using `__byte_perm` #15451

IMbackK commented Aug 20, 2025 •

edited

Loading

Qeeweew Aug 21, 2025 •

edited

Loading

IMbackK commented Aug 20, 2025 •

edited

Loading

IMbackK commented Aug 21, 2025 •

edited

Loading