Skip to content

Conversation

Qeeweew
Copy link
Contributor

@Qeeweew Qeeweew commented Aug 20, 2025

This PR accelerates MXFP4 vector dot product calculations by optimizing the get_int_from_table_16 function.

The previous table lookup logic is replaced with the more efficient __byte_perm CUDA intrinsic.

This results in a significant performance improvement for gpt-oss using the MXFP4 format on NVIDIA GPUs.

Benchmarks (RTX 3080 Ti, MXFP4 gpt-oss-20b):

Master Branch:

CUDA_VISIBLE_DEVICES=1 ./build_origin/bin/llama-bench -m ~/workspace/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      3719.22 ± 51.68 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        153.88 ± 0.26 |

build: a094f3814 (6210)

This PR:

CUDA_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m ~/workspace/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           pp512 |      4127.83 ± 29.88 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |  1 |           tg128 |        176.38 ± 0.12 |

build: bd4b6cd23 (6211)

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025
@IMbackK
Copy link
Collaborator

IMbackK commented Aug 20, 2025

While HIP also supports __byte_perm performance impact is negative there.

@lovedheart
Copy link

@IMbackK
ChatGPT's answer:

  1. What __byte_perm does

On CUDA, __byte_perm(x, y, s) maps directly to the PRMT (byte permute) instruction on NVIDIA GPUs.

That instruction can pick arbitrary bytes from two 32-bit registers and shuffle them into a single 32-bit result — in one cycle on modern NVIDIA SMs.

So on NVIDIA, this is a native instruction.

  1. On AMD / HIP

HIP provides __byte_perm as part of its CUDA compatibility, but there’s no direct equivalent PRMT instruction on GCN/RDNA ISAs.

The compiler has to emulate the operation: usually with a sequence of bitfield extract (BFE), shifts, masks, and ORs.

That expands into multiple instructions (5–12+ depending on pattern).

Throughput and latency are much worse than the single-cycle CUDA version.

}

static __device__ __forceinline__ int2 get_int_from_table_16(const int & q4, const int8_t * table) {
#if __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the code only enabled for Pascal and newer? According to the PTX documentation the instruction should be available on all NVIDIA GPUs supported by llama.cpp/ggml.

Copy link
Contributor Author

@Qeeweew Qeeweew Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right.

Comment on lines 36 to 44
mask = (0x32103210 | ((q4 & 0x88888888) >> 1));
v1 = __byte_perm(values[0], values[1], q4);
v2 = __byte_perm(values[2], values[3], q4);
v3 = __byte_perm(v1, v2, mask);
v1 = __byte_perm(values[0], values[1], q4 >> 16);
v2 = __byte_perm(values[2], values[3], q4 >> 16);
v4 = __byte_perm(v1, v2, mask >> 16);

return make_int2(__byte_perm(v3, v4, 0x6420), __byte_perm(v3, v4, 0x7531));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments explaining how this works.

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 20, 2025

@IMbackK ChatGPT's answer:

1. What __byte_perm does

On CUDA, __byte_perm(x, y, s) maps directly to the PRMT (byte permute) instruction on NVIDIA GPUs.

That instruction can pick arbitrary bytes from two 32-bit registers and shuffle them into a single 32-bit result — in one cycle on modern NVIDIA SMs.

So on NVIDIA, this is a native instruction.

2. On AMD / HIP

HIP provides __byte_perm as part of its CUDA compatibility, but there’s no direct equivalent PRMT instruction on GCN/RDNA ISAs.

The compiler has to emulate the operation: usually with a sequence of bitfield extract (BFE), shifts, masks, and ORs.

That expands into multiple instructions (5–12+ depending on pattern).

Throughput and latency are much worse than the single-cycle CUDA version.

Please dont with the chatgpt explanations. But in this case it happens to be correct, yes.

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 20, 2025

actually while __byte_perm is indeed implemented as bitshifts in hip, AMDGCN has a similar instruction that could be used: V_PERM_B32. The compiler is already using V_PERM_B32 for this function and a quick implementation that uses V_PERM_B32 by hand like this pr dose for nvidia is worse than what the compiler is doing.

This pr mostly works because the cuda compiler is doing pretty poorly here.

@slaren
Copy link
Member

slaren commented Aug 20, 2025

5090:

Model Test t/s master t/s byte_perm_opt Speedup
gpt-oss 20B MXFP4 MoE pp512 7903.90 8684.19 1.10
gpt-oss 20B MXFP4 MoE tg128 252.77 265.05 1.05
llama 7B IQ4_NL - 4.5 bpw pp512 14013.73 15617.30 1.11
llama 7B IQ4_NL - 4.5 bpw tg128 246.33 250.93 1.02

@Qeeweew
Copy link
Contributor Author

Qeeweew commented Aug 21, 2025

Using v_perm_b32 directly appears to provide some performance improvements on AMD GPUs either.

./test-backend-ops perf -o MUL_MAT

tested on RX 9070 in wsl

pr

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
Testing 2 devices

Backend 1/2: ROCm0
  Device description: AMD Radeon RX 9070
  Device memory: 16304 MB (13121 MB free)
  
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              28968 runs -    35.34 us/run - 117.44 MFLOP/run -   3.32 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):             27264 runs -    36.85 us/run - 117.44 MFLOP/run -   3.19 TFLOPS

master

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
Testing 2 devices

Backend 1/2: ROCm0
  Device description: AMD Radeon RX 9070
  Device memory: 16304 MB (13121 MB free)

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    34.26 us/run - 117.44 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):             18744 runs -    55.62 us/run - 117.44 MFLOP/run -   2.11 TFLOPS

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 21, 2025

Thats exactly what i tried an the perf impact was hugely negativ in cDNA, so we need to gate this @ v_Perm_b32

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are I think bad comments. They don't actually explain how __byte_perm works to replace the lookup using indices, the data layout after each step, and why e.g. mask is being constructed the way it is. Just remove the comments again and fix the AMD code path, then I'll test performance on my own hardware and write the comments myself.

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 21, 2025

Thats exactly what i tried an the perf impact was hugely negativ in cDNA, so we need to gate this @ v_Perm_b32

Actually i retract that, while what you did and what i tried are almost the same (only differ slightly in order), what i tried was perf negative but what you added to the pr is performance netural or very slightly positive on CDNA - strange.
Anyhow no need to gate it after all.

@drrros
Copy link

drrros commented Aug 24, 2025

Master:

drros@tesla:~/llama.cpp$ ./build/bin/llama-bench --model /mnt/ds1nfs/codellamaweights/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 4096 -p 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |       3830.22 ± 6.69 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |           tg128 |        138.25 ± 0.17 |

build: b730706a (6263)

This branch:

drros@tesla:~/llama.cpp$ ./build/bin/llama-bench --model /mnt/ds1nfs/codellamaweights/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 4096 -p 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |       4083.86 ± 7.69 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |    4096 |     4096 |  1 |           tg128 |        158.43 ± 0.22 |

build: 7720366f (6212)

waiting for merge!

@JohannesGaessler
Copy link
Collaborator

I pushed a version with revised comments and a rebase on master.

Performance
GPU Model Microbatch size Test t/s master t/s 741934c Speedup
P40 gpt-oss 20B MXFP4 MoE 1 pp512 63.60 81.98 1.29
P40 gpt-oss 20B MXFP4 MoE 2 pp512 73.88 100.61 1.36
P40 gpt-oss 20B MXFP4 MoE 4 pp512 114.14 156.89 1.37
P40 gpt-oss 20B MXFP4 MoE 8 pp512 173.84 238.82 1.37
P40 gpt-oss 20B MXFP4 MoE 16 pp512 267.70 393.48 1.47
P40 gpt-oss 20B MXFP4 MoE 32 pp512 385.46 482.63 1.25
P40 gpt-oss 20B MXFP4 MoE 64 pp512 497.37 581.00 1.17
P40 gpt-oss 20B MXFP4 MoE 128 pp512 736.88 849.49 1.15
P40 gpt-oss 20B MXFP4 MoE 256 pp512 1024.58 1166.61 1.14
P40 gpt-oss 20B MXFP4 MoE 512 pp512 1248.05 1398.47 1.12
3x P40 gpt-oss 120B MXFP4 MoE 1 pp512 58.18 74.92 1.29
3x P40 gpt-oss 120B MXFP4 MoE 2 pp512 59.81 79.02 1.32
3x P40 gpt-oss 120B MXFP4 MoE 4 pp512 89.75 120.83 1.35
3x P40 gpt-oss 120B MXFP4 MoE 8 pp512 129.05 176.00 1.36
3x P40 gpt-oss 120B MXFP4 MoE 16 pp512 183.17 272.22 1.49
3x P40 gpt-oss 120B MXFP4 MoE 32 pp512 232.30 293.13 1.26
3x P40 gpt-oss 120B MXFP4 MoE 64 pp512 260.81 306.50 1.18
3x P40 gpt-oss 120B MXFP4 MoE 128 pp512 347.60 405.86 1.17
3x P40 gpt-oss 120B MXFP4 MoE 256 pp512 455.47 527.44 1.16
3x P40 gpt-oss 120B MXFP4 MoE 512 pp512 515.93 590.76 1.15
RTX 3090 gpt-oss 20B MXFP4 MoE 1 pp512 177.17 204.53 1.15
RTX 3090 gpt-oss 20B MXFP4 MoE 2 pp512 164.40 192.74 1.17
RTX 3090 gpt-oss 20B MXFP4 MoE 4 pp512 274.68 325.91 1.19
RTX 3090 gpt-oss 20B MXFP4 MoE 8 pp512 447.18 529.76 1.18
RTX 3090 gpt-oss 20B MXFP4 MoE 16 pp512 730.44 865.66 1.19
RTX 3090 gpt-oss 20B MXFP4 MoE 32 pp512 1128.88 1347.95 1.19
RTX 3090 gpt-oss 20B MXFP4 MoE 64 pp512 1570.72 1781.12 1.13
RTX 3090 gpt-oss 20B MXFP4 MoE 128 pp512 2001.14 2175.27 1.09
RTX 3090 gpt-oss 20B MXFP4 MoE 256 pp512 3027.16 3286.91 1.09
RTX 3090 gpt-oss 20B MXFP4 MoE 512 pp512 3948.40 4260.13 1.08
RTX 4090 gpt-oss 20B MXFP4 MoE 1 pp512 275.77 278.50 1.01
RTX 4090 gpt-oss 20B MXFP4 MoE 2 pp512 253.80 282.86 1.11
RTX 4090 gpt-oss 20B MXFP4 MoE 4 pp512 439.42 489.87 1.11
RTX 4090 gpt-oss 20B MXFP4 MoE 8 pp512 737.88 816.85 1.11
RTX 4090 gpt-oss 20B MXFP4 MoE 16 pp512 1231.27 1367.45 1.11
RTX 4090 gpt-oss 20B MXFP4 MoE 32 pp512 2028.55 2315.94 1.14
RTX 4090 gpt-oss 20B MXFP4 MoE 64 pp512 3032.18 3376.37 1.11
RTX 4090 gpt-oss 20B MXFP4 MoE 128 pp512 4164.72 4513.75 1.08
RTX 4090 gpt-oss 20B MXFP4 MoE 256 pp512 6452.21 7011.96 1.09
RTX 4090 gpt-oss 20B MXFP4 MoE 512 pp512 8939.46 9737.98 1.09
3x RTX 4090 gpt-oss 120B MXFP4 MoE 1 pp512 254.65 257.01 1.01
3x RTX 4090 gpt-oss 120B MXFP4 MoE 2 pp512 216.32 242.79 1.12
3x RTX 4090 gpt-oss 120B MXFP4 MoE 4 pp512 356.45 402.53 1.13
3x RTX 4090 gpt-oss 120B MXFP4 MoE 8 pp512 560.90 630.59 1.12
3x RTX 4090 gpt-oss 120B MXFP4 MoE 16 pp512 875.75 989.65 1.13
3x RTX 4090 gpt-oss 120B MXFP4 MoE 32 pp512 1278.09 1490.90 1.17
3x RTX 4090 gpt-oss 120B MXFP4 MoE 64 pp512 1663.85 1902.12 1.14
3x RTX 4090 gpt-oss 120B MXFP4 MoE 128 pp512 1921.13 2121.58 1.10
3x RTX 4090 gpt-oss 120B MXFP4 MoE 256 pp512 2808.07 3090.01 1.10
3x RTX 4090 gpt-oss 120B MXFP4 MoE 512 pp512 3569.59 3928.72 1.10
RX 6800 gpt-oss 20B MXFP4 MoE 1 pp512 75.25 78.37 1.04
RX 6800 gpt-oss 20B MXFP4 MoE 2 pp512 71.04 78.84 1.11
RX 6800 gpt-oss 20B MXFP4 MoE 4 pp512 109.57 119.37 1.09
RX 6800 gpt-oss 20B MXFP4 MoE 8 pp512 157.45 169.47 1.08
RX 6800 gpt-oss 20B MXFP4 MoE 16 pp512 232.47 262.01 1.13
RX 6800 gpt-oss 20B MXFP4 MoE 32 pp512 312.19 335.07 1.07
RX 6800 gpt-oss 20B MXFP4 MoE 64 pp512 374.56 389.75 1.04
RX 6800 gpt-oss 20B MXFP4 MoE 128 pp512 546.24 566.68 1.04
RX 6800 gpt-oss 20B MXFP4 MoE 256 pp512 714.39 737.59 1.03
RX 6800 gpt-oss 20B MXFP4 MoE 512 pp512 786.63 807.11 1.03

@JohannesGaessler JohannesGaessler merged commit 74f52f7 into ggml-org:master Aug 25, 2025
44 checks passed
@mizuikk
Copy link

mizuikk commented Aug 26, 2025

Performance has improved a lot.

Non-rigorous Performance Comparison for gpt-oss 20B MXFP4 MoE

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

Batch Size Comparison

n_batch n_ubatch Test t/s (Build e92734d (6250)) t/s (Build 39842a7 (6281)) Increment (%)
4096 2048 pp2048 4033.44 ± 128.36 4080.68 ± 128.59 1.17
4096 2048 pp8192 3572.80 ± 4.06 3608.89 ± 3.72 1.01
4096 2048 pp16384 3043.20 ± 3.20 3078.45 ± 4.16 1.16
4096 2048 pp32768 2352.16 ± 0.54 2382.57 ± 0.32 1.29
4096 2048 tg128 125.71 ± 0.02 139.66 ± 0.02 11.10
4096 4096 pp2048 4089.27 ± 11.14 4126.88 ± 10.73 0.92
4096 4096 pp8192 3767.56 ± 10.57 3791.91 ± 10.15 0.65
4096 4096 pp16384 3184.26 ± 2.14 3210.55 ± 1.21 0.83
4096 4096 pp32768 2428.32 ± 1.06 2394.49 ± 107.44 -1.39
4096 4096 tg128 125.55 ± 0.02 139.57 ± 0.05 11.16

Same talk Comparison

t/s (Build e92734d (6250)) t/s (Build 39842a7 (6281)) Increment (%)
66.76 71.27 6.33

Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 27, 2025
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants