guide : running gpt-oss with llama.cpp #15396

ggerganov · 2025-08-18T15:09:57Z

ggerganov
Aug 18, 2025
Maintainer

Note

This guide is a live document. Feedback and benchmark numbers are welcome - the guide will be updated accordingly.

Overview

This is a detailed guide for running the new gpt-oss models locally with the best performance using llama.cpp. The guide covers a very wide range of hardware configurations. The gpt-oss models are very lightweight so you can run them efficiently in surprisingly low-end configurations.

Obtaining `llama.cpp` binaries for your system

Make sure you are running the latest release of llama.cpp: https://github.com/ggml-org/llama.cpp/releases

Obtaining the `gpt-oss` model data (optional)

The commands used below in the guide will automatically download the model data and store it locally on your device. So this step is completely optional and provided for completeness.

The original models provided by OpenAI are here:

First, you need to manually convert them to GGUF format. For convenience, we host pre-converted models here in ggml-org.

Pre-converted GGUF models:

Tip

Running the commands below will automatically download the latest version of the model and store it locally on your device for later usage. A WebUI chat and an OAI-compatible API will become available on localhost.

Sample output of using gpt-oss-120b* with the built-in llama-server WebUI*	Using llama-server* with crush coding agent (gpt-oss-20b)*

Minimum requirements

Here are some hard memory requirements for the 2 models. These numbers could vary a little bit by adjusting the CLI arguments, but should give a good reference point.

Model	Model data (GB)	Compute buffers (GB)	KV cache per 8 192 tokens (GB)	Total @ 8 192 tokens (GB)	Total @ 32 768 tokens (GB)	Total @ 131 072 tokens (GB)
gpt‑oss 20B	12.0	2.7	0.2	14.9	15.5	17.9
gpt‑oss 120B	61.0	2.7	0.3	64.0	64.9	68.5

Note

It is not necessary to fit the entire model in VRAM to get good performance. Offloading just the attention tensors and the KV cache in VRAM and keeping the rest of the model in the CPU RAM can provide decent performance as well. This is taken into account in the rest of the guide.

Relevant CLI arguments

Using the correct CLI arguments in your commands is crucial for getting the best performance for your hardware. Here is a summary of the important flags and their meaning:

Argument	Purpose
`-hf`	Specify the Hugging Face model ID to use. The model will be downloaded using `curl` from the respective model repository
`-c`	Specify the context size to use. More context requires more memory. Both `gpt-oss` models have a maximum context of 128k tokens. Use `-c 0` to set to the model's default
`-ub N -b N`	Specify the max batch size `N` during processing. Larger size increases the size of compute buffers, but can improve the performance in some cases
`-fa`	Enable Flash Attention kernels. This improves the performance on backends that support the operator
`--n-cpu-moe N`	Number of MoE layers `N` to keep on the CPU. This is used in hardware configs that cannot fit the models fully on the GPU. The specific value depends on your memory resources and finding the optimal value requires some experimentation
`--jinja`	Tell `llama.cpp` to use the Jinja chat-template embedded in the GGUF model file

Apple Silicon

Apple Silicon devices have unified memory that is seamlessly shared between the CPU and GPU. For optimal performance it is recommended to not exceed 70% of the total memory that your device has.

Tip

Install the latest llama.cpp package from Homebrew with:

brew install llama.cpp

Tip

To increase the amount of RAM available to the llama-server process, use the following command:

# on a 192GB machine, raise the limit from 154GB (default) to 180GB
sudo sysctl iogpu.wired_limit_mb=184320

✅ Devices with more than 96GB RAM

The M2 Max, M3 Max, M4 Max, M1 Ultra, M2 Ultra, M3 Ultra, etc. chips can run both models at full context:

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on M3 Ultra (512GB, 80 GPU cores) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp2048	2816.47 ± 2.74
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp8192	2308.17 ± 5.98
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp16384	1879.98 ± 1.99
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp32768	1351.67 ± 4.32
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	tg128	115.52 ± 0.29

build: c8d0d14 (6310)

🟢 Benchmarks on M2 Ultra (192GB, 76 GPU cores) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp2048	2191.13 ± 2.65
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp8192	1889.83 ± 3.91
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp16384	1594.51 ± 2.42
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp32768	1218.99 ± 0.44
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	tg128	116.08 ± 0.18

build: 79c1160 (6123)

llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	1.112	115.06	1.113	115.05
0	128	2	256	0.000	0.00	1.601	159.92	1.601	159.91
0	128	4	512	0.000	0.00	2.463	207.85	2.463	207.84
2048	128	1	2176	0.990	2068.28	1.163	110.03	2.154	1010.44
2048	128	2	4352	1.916	2137.49	1.710	149.72	3.626	1200.17
2048	128	4	8704	3.775	2169.82	2.656	192.78	6.431	1353.37
8192	128	1	8320	4.344	1885.93	1.279	100.11	5.622	1479.81
8192	128	2	16640	8.689	1885.52	1.929	132.69	10.619	1567.04
8192	128	4	33280	17.359	1887.62	3.053	167.69	20.413	1630.35
16384	128	1	16512	10.202	1606.01	1.397	91.63	11.599	1423.63
16384	128	2	33024	20.715	1581.82	2.186	117.08	22.902	1441.98
16384	128	4	66048	41.721	1570.80	3.653	140.14	45.375	1455.61
32768	128	1	32896	26.611	1231.39	1.665	76.88	28.276	1163.40
32768	128	2	65792	54.977	1192.06	2.794	91.64	57.771	1138.85
32768	128	4	131584	111.278	1177.88	4.883	104.85	116.161	1132.77

llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on M2 Ultra (192 GB, 76 GPU cores) for `gpt-oss-120b`

llama-bench -m gpt-oss-120b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal	2048	1	pp2048	1244.57 ± 5.10
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal	2048	1	pp8192	1101.31 ± 0.99
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal	2048	1	pp16384	955.41 ± 0.64
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal	2048	1	pp32768	752.31 ± 1.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal	2048	1	tg128	79.68 ± 0.12

build: 79c1160 (6123)

llama-batched-bench -m gpt-oss-120b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	1.610	79.48	1.611	79.48
0	128	2	256	0.000	0.00	2.284	112.08	2.284	112.08
0	128	4	512	0.000	0.00	3.477	147.27	3.477	147.27
2048	128	1	2176	1.776	1152.89	1.711	74.82	3.487	623.99
2048	128	2	4352	3.382	1211.16	2.458	104.14	5.840	745.18
2048	128	4	8704	6.505	1259.34	3.747	136.65	10.252	849.02
8192	128	1	8320	7.294	1123.16	1.857	68.94	9.150	909.25
8192	128	2	16640	14.467	1132.48	2.767	92.53	17.234	965.53
8192	128	4	33280	28.801	1137.74	4.358	117.50	33.159	1003.66
16384	128	1	16512	16.580	988.15	2.058	62.18	18.639	885.89
16384	128	2	33024	33.426	980.31	3.174	80.66	36.600	902.29
16384	128	4	66048	67.190	975.39	5.245	97.61	72.435	911.83
32768	128	1	32896	42.075	778.81	2.452	52.20	44.527	738.79
32768	128	2	65792	86.615	756.64	4.029	63.54	90.644	725.83
32768	128	4	131584	173.762	754.32	7.020	72.94	180.782	727.86

✅ Devices with less than 96GB RAM

The small gpt-oss-20b model can run efficiently on Macs with at least 16GB RAM:

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on M4 Max (36GB) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp2048	1277.42 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp8192	1030.28 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp16384	779.44 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp32768	568.13 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	tg128	92.36 ± 0.00

build: 79c1160 (6123)

llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	1.359	94.17	1.359	94.15
2048	128	1	2176	1.676	1222.17	1.450	88.30	3.125	696.26
8192	128	1	8320	7.624	1074.47	1.552	82.47	9.176	906.67
16384	128	1	16512	19.210	852.91	1.669	76.67	20.879	790.84
32768	128	1	32896	55.684	588.46	1.976	64.76	57.661	570.51

🟢 Benchmarks on M1 Max (64GB) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp2048	994.75 ± 4.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp8192	843.01 ± 2.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp16384	698.82 ± 0.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp32768	497.65 ± 8.92
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	tg128	75.15 ± 0.98

build: 2e2b22b (6180)

🟢 Benchmarks on M1 Pro (32GB) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp2048	515.76 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp8192	437.22 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	pp16384	361.29 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal	2048	1	tg128	45.68 ± 0.00

build: 79c1160 (6123)

llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384 -ntg 128 -npl 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	2.806	45.62	2.806	45.62
2048	128	1	2176	4.054	505.14	3.076	41.61	7.130	305.18
8192	128	1	8320	18.444	444.15	3.329	38.45	21.773	382.12
16384	128	1	16512	44.683	366.67	3.780	33.86	48.464	340.71

✅ Devices with 16GB RAM

Macs don't allow to utilize the full 16GB memory by the GPU, so in this case you have to keep part of the layer on the CPU. Adjust --n-cpu-moe and -c as needed:
```
llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -c 32768 --jinja --no-mmap
```

🟥 Devices with 8GB RAM

Unfortunately, you are out of luck. The gpt-oss models are not possible to run on Macs with that small amount of memory.

NVIDIA

✅ Devices with more than 64GB VRAM

With more than 64B VRAM, you can run both models by offloading everything (both the model and the KV cache) to the GPU(s).

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`

llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp2048	9480.55 ± 44.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp8192	8921.62 ± 4.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp16384	8196.12 ± 19.16
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp32768	7050.35 ± 12.36
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	tg128	249.96 ± 0.99

build: f08c4c0 (6199)

🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-20b`

llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp2048	11521.95 ± 26.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp8192	10673.03 ± 22.35
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp16384	9772.06 ± 19.59
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	pp32768	8267.46 ± 15.58
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	2048	1	tg128	286.91 ± 0.22

build: a6d3cfe (6205)

llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`

llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp2048	4494.20 ± 20.87
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp8192	4327.73 ± 16.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp16384	4114.04 ± 12.84
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp32768	3718.01 ± 19.67
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	tg128	170.62 ± 0.47

build: f08c4c0 (6199)

🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-120b`

llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp2048	5518.07 ± 31.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp8192	5315.65 ± 21.91
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp16384	5012.78 ± 24.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	pp32768	4503.36 ± 31.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	2048	1	tg128	196.31 ± 0.14

build: a6d3cfe (6205)

✅ Devices with less than 64GB VRAM

In this case, you can fit the small gpt-oss-20b model fully in VRAM for optimal performance.

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks on NVIDIA GeForce RTX 3090 (24GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, CUDA 12.4

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	5170.56 ± 14.10
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	4771.74 ± 12.96
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	4289.11 ± 3.22
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	3577.10 ± 2.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	161.77 ± 0.56
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	5142.90 ± 26.50
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	4711.52 ± 4.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	4245.67 ± 5.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	3539.35 ± 2.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	161.95 ± 0.49

build: a094f38 (6210)

🟢 Benchmarks on NVIDIA GeForce RTX 4090 (24GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, CUDA 12.6

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	8022.33 ± 161.33
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	7264.73 ± 69.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	6298.35 ± 94.91
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	5112.35 ± 34.90
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	221.95 ± 6.34
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	8078.28 ± 39.37
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	6715.17 ± 204.96
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	6025.25 ± 66.75
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	4924.71 ± 26.63
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	225.22 ± 0.10

build: a094f38 (6210)

🟢 Benchmarks on NVIDIA GeForce RTX 4080 SUPER (16GB) for `gpt-oss-20b`

llama-bench -m 'gpt-oss-20b-mxfp4.gguf' -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp2048	8170.95 ± 10.83
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp8192	7989.22 ± 48.54
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp16384	7517.93 ± 11.39
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp32768	6739.51 ± 12.77
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	tg128	186.51 ± 0.33
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp2048	8145.36 ± 47.93
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp8192	7992.03 ± 22.22
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp16384	7560.80 ± 8.81
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp32768	6720.33 ± 21.73
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	tg128	185.68 ± 0.24
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp2048	8120.09 ± 23.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp8192	7942.44 ± 7.77
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp16384	7532.66 ± 12.13
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp32768	6735.01 ± 7.80
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	tg128	186.17 ± 0.34
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp2048	8110.85 ± 35.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp8192	7510.58 ± 22.65
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp16384	7222.12 ± 6.87
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp32768	6478.02 ± 2.87
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	tg128	186.37 ± 0.59

build: 009b709 (6316)

🟢 Benchmarks on NVIDIA GeForce RTX 5060 Ti (16GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	3839.21 ± 6.79
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	3695.85 ± 6.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	3472.60 ± 1.82
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	3078.06 ± 0.62
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	111.51 ± 0.05
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	3821.18 ± 13.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	3591.27 ± 1.45
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	3385.30 ± 2.44
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	3009.63 ± 2.82
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	111.56 ± 0.02

build: 9ef6b0b (6208)

🟢 Benchmarks on NVIDIA GeForce RTX 5070 Ti (16GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, CUDA 12.8

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	6339.76 ± 25.60
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	5913.85 ± 9.12
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	5375.41 ± 10.22
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	4547.18 ± 3.70
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	189.45 ± 0.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	6325.97 ± 37.98
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	5669.50 ± 13.36
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	5193.12 ± 5.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	4411.35 ± 2.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	189.46 ± 0.03

build: a094f38 (6210)

🟢 Benchmarks on NVIDIA GeForce RTX 5080 (16GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes, CUDA 12.8

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	7476.55 ± 20.89
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	7047.73 ± 19.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	6465.65 ± 23.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	5531.03 ± 29.67
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	204.85 ± 0.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	7469.28 ± 43.22
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	6725.38 ± 11.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	6218.87 ± 25.68
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	5376.58 ± 31.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	204.86 ± 0.04

build: a094f38 (6210)

🟢 Benchmarks on NVIDIA GeForce RTX 5090 (32GB) for `gpt-oss-20b`

$ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, CUDA 12.8

model	size	params	backend	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp2048	9848.38 ± 28.98
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp8192	8834.14 ± 27.65
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp16384	7802.21 ± 35.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	pp32768	6290.76 ± 64.50
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	2048	1	tg128	282.51 ± 0.44
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp2048	9841.15 ± 29.56
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp8192	8482.25 ± 44.45
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp16384	7513.55 ± 34.19
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	pp32768	6089.55 ± 77.41
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	4096	4096	1	tg128	282.26 ± 0.10

build: a094f38 (6210)

The large model has to be partially kept on the CPU.

🟡 TODO: add commands for gpt-oss-120b

✅ Devices with 16GB VRAM

For example: NVIDIA V100

This config is just at the edge to fit the full context of gpt-oss-20b in VRAM, so we have to restrict the maximum context down to 32k tokens.

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 32768 --jinja -ub 4096 -b 4096

🟢 Benchmarks on NVIDIA V100 (16GB) for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -fa 1 -b 4096 -ub 4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp2048	3526.65 ± 346.86
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp8192	3320.62 ± 44.98
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp16384	2768.99 ± 19.73
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp32768	2096.44 ± 8.58
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	tg128	117.71 ± 0.30

build: 228f724 (6129)

llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 33792 -b 4096 -ub 4096 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	1.106	115.74	1.106	115.72
2048	128	1	2176	0.481	4257.50	1.201	106.60	1.682	1293.82
8192	128	1	8320	2.247	3646.05	1.417	90.31	3.664	2270.69
16384	128	1	16512	5.421	3022.12	1.659	77.14	7.081	2331.96
32768	128	1	32896	15.031	2180.10	2.121	60.35	17.151	1917.98

Running the large gpt-oss-120b model with 16GB of VRAM requires to keep some of the layers on the CPU since it does not fit completely in VRAM:

llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 4096 -b 4096 --n-cpu-moe 32

✅ Devices with less than 16GB VRAM

For this config, it is recommended to tell llama.cpp to run the entire model on the GPU while keeping enough layers on the CPU. Here is a specific example with an RTX 2060 8GB machine:

# gpt-oss-20b, full context, 22 layers on the CPU
llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0     --jinja -ub 2048 -b 2048 --n-cpu-moe 22

# gpt-oss-20b, 32k context, 16 layers on the CPU (faster, but has less total context)
llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 16

Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too:

# gpt-oss-120b, 32k context, 35 layers on the CPU
llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 35

Tip

For more information about how to adjust the CPU layers, see the "Tips" section at the end of this guide.

AMD

Note

If you have AMD hardware, please provide feedback about running the gpt-oss models on it and the performance that you observe. See the sections above for what kind of commands to try and try to adjust respectively.

With AMD devices, you can use either the ROCm or the Vulkan backends. Depending on your specific hardware, the results can vary.

✅ RX 7900 XT (20GB VRAM) using ROCm backend

llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048

🟢 Benchmarks for `gpt-oss-20b`

llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B BF16	12.83 GiB	20.91 B	ROCm,RPC	99	1	4096	2048	1	pp2048	4251.56 ± 21.68
gpt-oss 20B BF16	12.83 GiB	20.91 B	ROCm,RPC	99	1	4096	2048	1	pp8192	3567.45 ± 11.84
gpt-oss 20B BF16	12.83 GiB	20.91 B	ROCm,RPC	99	1	4096	2048	1	pp16384	2948.39 ± 10.34
gpt-oss 20B BF16	12.83 GiB	20.91 B	ROCm,RPC	99	1	4096	2048	1	pp32768	2099.25 ± 13.17
gpt-oss 20B BF16	12.83 GiB	20.91 B	ROCm,RPC	99	1	4096	2048	1	tg128	101.92 ± 0.27

build: 3007baf (6194)

More information: #15396 (comment)

✅ Few more low-end configurations
- AMD Radeon 890M using Vulkan
- AMD Radeon FirePro W8100 + AMD Radeon RX 470 using Vulkan

Tips

Determining the optimal number of layers to keep on the CPU

Good general advice for most MoE models would be to offload the entire model, and use -n-cpu-moe to keep as many MoE layers as necessary on the CPU. The minimum amount of VRAM to do this with the 120B model is about 8GB, below that you will need to start reducing context size and the number of layers offloaded. You can get for example about 30 t/s at zero context on a 5090 with --n-cpu-moe 21.

Caveat: on Windows it is possible to allocate more VRAM than available, and the result will be slow swapping to RAM and very bad performance. Just because the model loads without errors, it doesn't mean you have enough VRAM for the settings that you are using. A good way to avoid this is to look at the "GPU Memory" in Task Manager and check that it does not exceed the GPU VRAM.

Example on 5090 (32GB):
good, --n-cpu-moe 21, GPU Memory < 32:

bad, --n-cpu-moe 20, GPU Memory > 32:

Using `gpt-oss` + `llama.cpp` with coding agents (such as Claude Code, crush, etc.)

Setup the coding agent of your choice to look for a localhost OAI endpoint (see Tutorial: Offline Agentic coding with llama-server #14758)

Start llama-server like this:

# adjust this command for your hardware
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048

# some agents such as Claude Code can benefit from multiple parallel server slots
# note: currently this requires extra memory!
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 524288 -np 4 --jinja -ub 2048 -b 2048

Sample usage with crush: guide : running gpt-oss with llama.cpp #15396 (reply in thread)
Some agents such as Cline and Roo Code do not support native tool calls. A workaround is to use a custom grammar: guide : running gpt-oss with llama.cpp #15396 (comment)

Configure the default sampling and reasoning settings

When starting a llama-server command, you can change the default sampling and reasoning settings like so:

# use recommended gpt-oss sampling params
llama-server ... --temp 1.0 --top-p 1.0

# set default reasoning effort
llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

Note that these are just the default settings and they could be overridden by the client connecting to the llama-server.

Frequently asked questions

Q: Which quants to use?

Always use the original MXFP4 model files. The gpt-oss models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent to ggml's Q4_0. The main difference with Q4_0 is that the MXFP4 models get to keep their full quality. This means that no quantization in the usual sense is necessary.

Q: What sampling parameters to use?

OpenAI recommends: temperature=1.0 and top_p=1.0.

Do not use repetition penalties! Some clients tend to enable repetition penalties by default - make sure to disable those.

Q: Should I set a chat template file manually?

No. The ggml-org/gpt-oss models have a built-in chat template that is used by default. The only reasons to ever want to change the chat template manually are:

If there is a bug in the built-in chat template
If you have a very specific use case and you know very well what you are doing

Known issues

Some rough edges in the implementation are still being polished. Here is a list of issue to keep track of:

SteelPh0enix · 2025-08-18T16:33:33Z

SteelPh0enix
Aug 18, 2025

I can provide some numbers for AMD part of the guide.

My hardware is RX 7900 XT (20GB VRAM) + Ryzen 9 5900X + 32GB of RAM, running on latest Arch Linux with locally built llama.cpp version 6194 (3007baf), built with ROCm 6.4.1-1 (from official Arch repo)

Pulled the gpt-oss-20b repository and converted to GGUF using convert_hf_to_gguf.py, should probably result in the same GGUF file as on huggingface.

7900XT can load the full 20B model with full context without offloading MoE layers to CPU (although barely, because it will fill up the whole VRAM), by running

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa

With that, i get generation speeds (as reported by llama.cpp webui) at ~94 tokens/second, slowly going down as the context fills up.

I've also tested whether setting K/V cache quantization would help with model size or performance, but the result was... bad, performance was halved and CPU got involved... is this because of mxfp4 format of gpt-oss?

I'd also like to note that my PC likes to hang when i fill up my VRAM to the brim, so i've also checked out how gpt-oss-20b behaves when i off-load MoE layers to CPU.

When running with all MoE layers on CPU, as below:

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -cmoe

my GPU VRAM usage (as reported by btop) is around 10GB, RAM usage went up only ~2GB. However, the performance took a major 80% hit, as now my generation speed is in ~20tok/s - CPU takes most of the load. If you have better CPU and faster RAM (i'm still running dual-channel DDR4s @ 3200MHz CL16, mind you), you probably will get better results. I wonder how X3Ds behave in that case...

I assume that gpt-oss-20b has 24 MoE layers, so let's see how it behaves when i load only, let's say, 4 onto CPU:

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 4

VRAM is at 18GB (previously it was at 19, as reported by btop, so there's a decrease), RAM usage went up by around 1.5GB, generation speed is ~60tok/s. Neat, this is usable.

How about 8 layers?

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 8

In that case, i get 16GB VRAM usage, ~1.5GB RAM bump as previously, and generation speed went down to 38 tokens/s. Still pretty usable. How about 16 layers?

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 16

VRAM: 13GB, RAM: as previously, not more than 2GB, generation speed: 27-25tok/s, this is getting pretty bad.

As mentioned before - your results may vary, i'm not running current-gen top-tier hardware and IIRC the largest performance bottleneck will be the RAM/PCIe link speed anyway - i'm pretty curious to see what the performance with this GPU is on more recent platform, especially with an X3D CPU.

11 replies

aldehir Aug 18, 2025
Collaborator

I had issues with a higher batch/ubatch size than the default but I'm not seeing that problem anymore so that was probably user error on my end.

I believe you are likely hitting the case where the model needs the CoT from the past tool call but the client isn't sending it in or there is a mismatch in the reasoning field. That is an open issue across all client and inference servers/providers with GPT-OSS.

If you can collect any dumps of this happening, I'm happy to dig in further.

SteelPh0enix Aug 18, 2025

@SteelPh0enix I've been able to get crush to work with moderate success.

Can you share the llama-server command line arguments you pass in?

Sure @aldehir, here's my config for this model in crush.json:

 "providers": {
        "llamacpp": {
            "type": "openai",
            "base_url": "http://steelph0enix.pc:51536/v1",
            "name": "Llama.cpp",
            "id": "llamacpp",
            "models": [
                {
                    "id": "gpt-oss-20b.auto",
                    "name": "GPT-OSS 20B",
                    "context_window": 131072,
                    "default_max_tokens": 51200,
                    "has_reasoning_efforts": true,
                    "can_reason": true,
                    "supports_attachments": false,
                    "default_reasoning_effort": "high",
                    "cost_per_1m_in": 0,
                    "cost_per_1m_in_cached": 0,
                    "cost_per_1m_out": 0,
                    "cost_per_1m_out_cached": 0
                }
            ]
        }
    }
// ...

my llama-server invocation:

llama-server --ctx-size 0 --model gpt-oss-20b.auto.gguf --alias "gpt-oss-20b.auto" --jinja

i keep most of my llama-server settings in env vars, as following:

export LLAMA_ARG_HOST="0.0.0.0"
export LLAMA_ARG_PORT="51536"
export LLAMA_ARG_BATCH=2048
export LLAMA_ARG_UBATCH=2048
export LLAMA_ARG_SWA_FULL=false
export LLAMA_ARG_KV_SPLIT=false
export LLAMA_SET_ROWS=1 # for ARG_KV_SPLIT=false to work
export LLAMA_ARG_FLASH_ATTN=true
export LLAMA_ARG_MLOCK=true
export LLAMA_ARG_NO_MMAP=false
export LLAMA_ARG_N_GPU_LAYERS=999
export LLAMA_OFFLINE=true
export LLAMA_ARG_ENDPOINT_SLOTS=true
export LLAMA_ARG_ENDPOINT_PROPS=true

I've opened my test project (a Rust app) with gpt-oss-20b as chosen model in Crush and told it to initialize the project... and it seems to work just fine now!

I've tested Crush back on 0.6.0 (or 0.6.1) with gpt-oss, if not on 0.5.x, and i definitely had issues (for example, the chat description below CRUSH logo contained gpt-oss chat template tags...) so you must've fixed it already - i just haven't noticed :)

Thanks!

If you want, you may add my piece of crush.json to the post as a config example (change the IP to localhost though ;) ) @ggerganov, the invocation from the original post should work just fine

SteelPh0enix Aug 18, 2025

I had issues with a higher batch/ubatch size than the default but I'm not seeing that problem anymore so that was probably user error on my end.

I believe you are likely hitting the case where the model needs the CoT from the past tool call but the client isn't sending it in or there is a mismatch in the reasoning field. I believe that's an open issue across all client and inference servers/providers with GPT-OSS.

If you can collect any dumps of this happening, I'm happy to dig in further.

Yes, i think i did notice that on some other models, i've been mostly working with Qwen... If this happens again, how can i get some more logs/info?

Oh, and one potential "issue" i've just noticed - i've set my reasoning in model config to "high", but crush seem to force "minimal", is this "by design", or is it some issue?

aldehir Aug 18, 2025
Collaborator

I can't say, but it likely has no effect since llama-server only respects the chat_template_kwargs.reasoning_effort field in the request. I doubt crush is setting it, so it defaults to "medium" unless you change it via command line.

I usually run mitmproxy in the background, but enabling verbose and searching for parse errors in the server log will likely show the root of the problem--if there is one.

SteelPh0enix Aug 18, 2025

btw @ggerganov i think you made a mistake labeling my test results, i don't have a mac, and they sure don't use AMD GPUs anymore 😄

gcp · 2025-08-18T19:22:50Z

gcp
Aug 18, 2025

Configure the default sampling and reasoning settings

When starting a llama-server command, you can change the default sampling and reasoning settings like so:

# use recommended gpt-oss sampling params
llama-server ... --temp 1.0 --top-p 1.0

Q: What sampling parameters to use?

OpenAI recommends: temperature=1.0 and top_p=1.0.

The problem I see is that the llama-server defaults are

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'                                                                  
                                        (default:                                                            
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)                                                                                            
--temp N                                temperature (default: 0.8)                                           
--top-k N                               top-k sampling (**default: 40**, 0 = disabled)                         
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (**default: 0.1**, 0.0 = disabled)

So the above command is actually equivalent to:

llama-server ... --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.1

Which seems quite a bit different from the actual recommendation from OpenAI. Notably "min-p 0.1" will prune a lot of low-probability tokens, whereas the OpenAI recommendation is basically to follow the model output probabilities.

If you look at a lot of guides and settings for other SOTA LLM, they all recommend min-p 0.01 or 0.00.

Should the command line be changed to:

llama-server ... --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

8 replies

ggerganov Aug 19, 2025
Maintainer Author

Regardless of the OpenAI recommendation, I still think it's a good idea to filter low-probability tokens (for example with Top-K or Min-P).

For now, I've updated the guide with the following paragraph:

Be careful when you disable the `Top K` sampler. Although recommended by OpenAI, this can lead to significant CPU overhead and small but non-zero probability of sampling low-probability tokens.

We can revisit if we determine that sampling from the full vocab is actually important.

SmallAndSoft Aug 19, 2025

then top_n_sigma with the default parameter of 1.

This isn't supported in llama-server. (A number of claims in top-n-sigma paper fall flat when temperature is applied last, as is the case for llama.cpp, so I'm not sure this is going to change any time soon)

Not sure what you mean. The support was merged in #13264
Temperature can be applied at any step you want if you define your own sampling chain.

gcp Aug 19, 2025

I still think it's a good idea to filter low-probability tokens (for example with Top-K or Min-P).

Yes, that seems quite sensible. Note that the default will have both top-k 40 and min-p 0.1.

The support was merged in #13264

Looks like a bug crept in there, I'll file an issue.

Temperature can be applied at any step you want if you define your own sampling chain.

Yes, and if you put it last like llama.cpp does by default, you don't have some of the key problems that sampler is supposed to fix 😀

Spyro000 Aug 29, 2025

Using min-p 0.0 causes significant performance losses: from 57 tokens per second at min-p 0.1 down to 35 tokens per second at min-p 0.0.

Tom94 Aug 29, 2025

That's to be expected. I'd recommend min-p 0.01 or even min-p 0.001 for behavior that's close enough to 0 with performance close to the default.

QuantiusBenignus · 2025-08-18T19:35:53Z

QuantiusBenignus
Aug 18, 2025

To fill better the low-end CUDA edge cases, here are some benchmarks for gpt-oss-20B (both MXFP4 and Unsloth UD quant) on 12GB VRAM:
Ryzen 7 5700X with 32GB RAM (PCIe 4), NVIDIA RTX3060, 12GB VRAM, with CUDA 13.0:
llama.cpp build: 6139

Optimal settings at 16K context window:
Comparing -ncmoe N vs. offloading just some of the later up-projection layers, e.g. -ot "\.([2-9][0-9])\.ffn_up_exps.=CPU". My reasoning was that front of NN is more 'expensive' to offload due to early layers seeing the full sequence - more work per token. In practice, not huge difference in my scenario:

❯llama-server -t 8 -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -fa -c 16384 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ot "\.([2-9][0-9])\.ffn_up_exps.=CPU"

(Leaves about 600 MB VRAM budget, with 67 tok/sec initial generation rate)
or

❯llama-server -t 8 -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -fa -c 16384 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ncmoe 2

(Leaves about 600 MB VRAM budget, with 64 tok/sec initial generation rate)

Optimal settings at 32K context window:

❯llama-server -t 8 -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -fa -c 32768 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"

(Leaves about 1.2 GB VRAM budget, with 53 tok/sec initial generation rate)
or

❯llama-server -t 8 -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -fa -c 32768 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ncmoe 3

(Leaves about 600 MB VRAM budget, with 56 tok/sec initial generation rate - this is too aggresive, will likely OOM before reaching context limit.)

#llama-bench for gpt-oss-20b-mxfp4.gguf:
bin❯master❯./llama-bench -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384 -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp2048 |       2229.95 ± 1.93 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp8192 |       2108.57 ± 6.36 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |         pp16384 |       1960.34 ± 2.08 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |           tg128 |         30.64 ± 0.08 |



#gpt-oss-20b-UD-Q4_K_XL.gguf (Unsloth)
bin❯master❯./llama-bench -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384 -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"  
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp2048 |       2212.30 ± 4.04 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp8192 |       2092.57 ± 6.44 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |         pp16384 |       1948.17 ± 0.93 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |           tg128 |         31.30 ± 0.07 |

build: 6139

ElliotK-IB Sep 4, 2025

-ncmoe N vs. offloading just some of the later up-projection layers, e.g. -ot ".([2-9][0-9]).ffn_up_exps.=CPU"

The first section looks like it compares the former on MXFP4 and the latter on UD-Q4_K_XL -- is this intentionally not a "controlled" experiment? Or is it that you're showing the optimal settings in your testing for the MXFPR4 and the UD-Q4_K_XL respectively? Just seeking clarification on what the pairs of results per context window size are for.

Is these where you got the GGUF files from?

Lastly, how is it that the 32K context with UD-Q4_K_XL leaves 1.2 GB VRAM but 16K only leaves over 600 MB? I'm understanding this as the 32K context left more free VRAM than 16K context.

QuantiusBenignus Sep 5, 2025

Hi @ElliotK-IB,
It was meant to save comment space after noticing that the additional partial quantization on the MXFP4 quants by Unsloth introduces no noticeable difference for essentially the same neural network when all run parameters are the same. So this should have tried better to convey that with either model variant, the offload of the chosen up-projection tensors (-ot ) is equivalent to the use of -ncmoe 2 (memory-wise, baring a few t/s tg speed difference), leaving about the same amount of VRAM available on the specific 12GB GPU.

The MXFP4 was either from Bartowski or ggml-org. The Unsloth URL is correct, I think.

On the second question, the command with 16k context uses less aggressive regexp (20 to 99), while the 32k context would offload also tensors from 10 to 99 (if they existed ) of the up-projections of the feed forward network, thus leaving more VRAM available. (Which is needed for the larger context, about 370 to 400 MB per 16k if not mistaken). Looking at the layer structure of the LLM, actually blanketing everything up to 99 is not necessary. (up to 29 would have sufficed). On that note, a more aggressive regexp to leave a few more of the up-projections in VRAM would be -ot ".(1[7-9]|2[0-4]).ffn_up_exps.=CPU". This offloads from layers 17 to 24, leaving about 600 MB free VRAM (which would not be enough if the full 32k context window is to be used, so -ot ".(1[6-9]|2[0-4]).ffn_up_exps.=CPU" would be living on the absolute edge).

Bottom line, this somewhat convoluted text shows optimal (with enough VRAM left for the chosen context window, except maybe in the last case) settings for the hardware in question and suggests that in most cases it is better to offload tensors, not whole expert layers to the CPU/RAM.
Assuming no other, unrelated, GPU-intensive, VRAM-gobbling tasks on the system, of course.

ElliotK-IB Sep 5, 2025

Interesting, I learned about offloading tensors vs layers thanks to your post, glad I asked further. Appreciate the detailed follow-up as well! I'll revisit this and this post I came across on r/LocalLLaMA for my own experiments.

aldehir · 2025-08-18T19:56:43Z

aldehir
Aug 18, 2025
Collaborator

Several people are having issues with tool calling in Cline/Roo Code when using gpt-oss-20b. This is because those clients do not use native tool calling and the model insists on native tool calls. There is a workaround by using a custom grammar that inhibits native tool calling:

root ::= analysis? start final .+
analysis ::= "<|channel|>analysis<|message|>" ( [^<] | "<" [^|] | "<|" [^e] )* "<|end|>"
start ::= "<|start|>assistant"
final ::= "<|channel|>final<|message|>"

Passing this in a file with --grammar-file yields good results when coupled with this system prompt:

Valid channels: analysis, final. Channel must be included for every message.

Is this something useful to include in the docs?

5 replies

ggerganov Aug 19, 2025
Maintainer Author

Could you ELI5 the difference between native and non-native tool calls? Or point me to a reference document.

aldehir Aug 19, 2025
Collaborator

Could you ELI5 the difference between native and non-native tool calls? Or point me to a reference document.

With native tool calls, the model invokes tools in its own syntax. The inference server is responsible for parsing it and exposing it via the API.

For gpt-oss, it generates tool calls in its harmony format through the commentary channel

<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location": "New York"}

Other models may place them in tags such as <tool_call></tool_call>.

With non-native tool calls, the client prompts the model to respond a certain way to perform a tool call.

For example, Cline prompts the model to respond in an XML format. E.g.,

<get_weather>
  <location>New York</location>
</get_weather>

gpt-oss-20b is adamant about producing a native tool call when told it has tools. By constraining the grammar to only produce content and not a tool call, you force it to do a non-native call that Cline/Roo Code expect.

Hope that clears things up.

ggerganov Aug 19, 2025
Maintainer Author

Thanks. Expanded the "Tips" section with a link to this thread.

maxiedaniels Aug 22, 2025

@aldehir so if I want to use these models with RooCode + Openrouter via Cerebras or Grok, is it on the provider to fix this or is it on the RooCode developers?

aldehir Aug 22, 2025
Collaborator

@maxiedaniels I doubt the providers will adopt this grammar, it really is more of a hack than a fix. I think the appropriate fix for Cline / Roo Code is to adopt native tool calling. Roo Code has an open PR that may be promising.

The 120B model should work (mostly) with Cline/Roo Code. It seems to follow instruction quite well, but may fail occasionally. The 20B seems to always fail, and this grammar helps with that.

Mushoz · 2025-08-18T20:59:05Z

Mushoz
Aug 18, 2025

Are we sure tool calling is currently implemented correctly? Openai has released a test script ( https://cookbook.openai.com/articles/gpt-oss/verifying-implementations ) to test backend implementations, but it's currently failing me with llama.cpp. Steps to run the test script:

git clone https://github.com/openai/openai-cookbook.git

cd gpt-oss/compatibility-test/

npm install
npm i -D tsx typescript @types/node

Then edit the providers.ts file (edit the correct details in):

export const PROVIDERS = {
  openai: {
    apiBaseUrl: "http://localhost:3001/v1",
    apiKey: "key",
    apiType: ["chat"], // choose from responses, chat, or both
    modelName: "GPT-OSS-120B",
    providerDetails: {
      // add any provider-specific details here. These will be passed as part of every request
      // for example to fix the provider for openrouter, you can do:
      // provider: {
      //   only: ["example"],
      // },
    },
  },
};

And then lastly start the test: npm start -- --provider openai

These are the results I obtained:

Summary:
  Provider: openai
  Total input cases: 30
  Tries: 1
  Total tasks: 29
  Total runs: 29
  Invalid Chat Completions API responses: 29 (out of 29)
  pass@k (k=1..1): 1=0.000
  pass^k (k=1..1): 1=0.000
  pass@k (k=1): 0.000
  pass^k (k=1): 0.000
  Wrong-input tool calls: 5
  Invalid cases.jsonl lines: 0

Expected outcome according to the guide: If your tests are successful, the output should show 0 invalid requests and over 90% on both pass@k and pass^k. This means the implementation should likely be correct.

Could anyone try replicating my findings? If they find the same, what should be done to fix this?

9 replies

aldehir Aug 18, 2025
Collaborator

Do you happen to have the patch to enable reasoning_content compatibility? Typescript is definitely not my strong suit. I tried changing the if (item.type === "reasoning") { check to if (item.type === "reasoning_content") {, but that didn't work.

I can't produce one right this moment, but if you duplicate the validResponse line and change hasReasoningField to hasReasoningContentField, and message.reasoning to message.reasoning_content, it should work.

aldehir Aug 18, 2025
Collaborator

It does look like they may have intended to pass the test if reasoning_content is set but forgot to add another line. That said, I do think the divergence between projects is problematic for clients. I created the discussion to see if there is community support behind adding a way to change the field.

0xshivamagarwal Aug 19, 2025

I believe there is some issue with the current implementation of tool calling.

I used openai/gpt-oss-20b model with both lmstudio (compatibility test: success) & llama-server (compatibility test: failed) [version: 6190 (ae532ea)] and logged the result variable in runCase.ts at line 105.
command used to run test and generate output : npm start -- --provider <provider_name> -n 1 -k 1

Attaching output of both for the reference: output-llama-server.log, output-lm-studio.log

If you see the output-lm-studio.log, then you will find actual tool call & it's response but the same is not present in the output-llama-server.log file.

Please let me know if I did something incorrectly or if I can provide any information that can be helpful in solving this.

Also, I don't see any tool calls in the output using lm-studio with ggml-org/gpt-oss-20b-GGUF model. So, I believe they have done some extra handling just for the openai model to support tool calling.

aldehir Aug 19, 2025
Collaborator

@0xshivamagarwal use --reasoning-format auto. none is no longer the recommendation, so you can opt to remove the option entirely as well (it defaults to auto).

0xshivamagarwal Aug 19, 2025

@aldehir thanks for pointing it out. Tool calling is working perfectly.
It's just the API response that needs to be updated to work perfectly with the tests.

P.S. Let me know if I should delete the comments to avoid confusion for anyone seeing it in future.

ericcurtin · 2025-08-18T21:57:14Z

ericcurtin
Aug 18, 2025
Collaborator

Maybe not relevant as the models are kinda large... But perf tested CPU inferencing on an Ampere system before --threads cores/2 was the sweet spot... Also --cache-reuse what are the considerations for use?

0 replies

mounta11n · 2025-08-18T22:21:31Z

mounta11n
Aug 18, 2025

@ggerganov I've tested one more Apple Silicon. Here are the results of my MBP M1 Max 64GB

🟢 Benchmarks on M1 Max (64GB) for gpt-oss-20b

time llama-bench -m gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp2048	994.75 ± 4.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp8192	843.01 ± 2.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp16384	698.82 ± 0.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp32768	497.65 ± 8.92
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	tg128	75.15 ± 0.98

build: 2e2b22b (6180)

llama-bench -m gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 10,31s user 2,38s system 2% cpu 10:15,13 total

4 replies

gsgxnet Aug 26, 2025

@ggerganov MBP M3 Max 128GB

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp2048	1347.72 ± 28.34
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp8192	1040.01 ± 19.65
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp16384	908.13 ± 7.98
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp32768	530.52 ± 74.68
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	tg128	64.26 ± 0.53

build: e92734d (6250)
llama-bench -m -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 8.02s user 5.50s system 2% cpu 8:58.38 total

It did only run when I gave the full path to the model file. With just the name llama-bench could not find the model.

ggerganov Aug 26, 2025
Maintainer Author

The tg128 number looks quite low. I think it's possible that this measurement was heat throttled (new macbooks unfortunately do this for some reason). You can run it alone like this:

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf  -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 0

For example on my M4 Max (36GB) I get this:

model	size	params	backend	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	2048	1	tg128	95.88 ± 0.12

build: 8b69686 (6293)

gsgxnet Aug 26, 2025

Yes the heat throttling seems to be a real big issue with MB Pros, especially with M3 MAX SoC and big RAM.
I had found that report #10444
At the moment I fear it is worse than just low inference speed. I will evaluate further with the gpt-oss eval scripts. So far I get really bad results.

gsgxnet Aug 26, 2025

Good benchmark seems to be this with an Mac mini M4 Pro 64 GB:

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp2048	700.78 ± 0.71
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp8192	618.59 ± 0.67
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp16384	534.95 ± 0.54
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	pp32768	419.74 ± 0.27
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	1	2048	1	tg128	63.34 ± 0.05

build: 0fd90db (6280)
llama-bench -m -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 6,29s user 2,40s system 1% cpu 13:01,60 total
tg128 seems consistent. And all t/s variations are low. Seems the very high variations in the MB Pro results are caused by the throttling.

Art9681 · 2025-08-18T22:51:32Z

Art9681
Aug 18, 2025

Getting the following error when attempting to use @playwright/mcp@latest MCP server. It's working well with other tools. This MCP server works fine with other models such as Devstral so its an issue with the gpt-oss-model implementation:

got exception: {"code":500,"message":"JSON schema conversion failed:\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}","type":"server_error"}

3 replies

aldehir Aug 19, 2025
Collaborator

I'm not seeing this problem with Chatbox + @playright/mcp@latest. Which client are you using?

Art9681 Aug 19, 2025

I'm using the official OpenAI Go SDK. I can enable other MCP servers and internal tool implementations according to spec and they work fine.

aldehir Aug 19, 2025
Collaborator

I'm afraid I'm stumped, as I cannot reproduce this with Chatbox or Crush. Neither produce a JSON schema with "not": {} for the playwright MCP server.

From what I can tell, not is unsupported, but I'm not equipped to give you a good answer. I recommend creating an issue and maybe someone more knowledgeable can help.

SoftwareRenderer · 2025-08-18T23:19:26Z

SoftwareRenderer
Aug 18, 2025

Adding benchmarks for NVIDIA > 64 GB VRAM.

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`

llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp2048	9480.55 ± 44.01
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp8192	8921.62 ± 4.21
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp16384	8196.12 ± 19.16
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp32768	7050.35 ± 12.36
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg128	249.96 ± 0.99

build: f08c4c0 (6199)

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`

llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp2048	4494.20 ± 20.87
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp8192	4327.73 ± 16.04
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp16384	4114.04 ± 12.84
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp32768	3718.01 ± 19.67
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	tg128	170.62 ± 0.47

build: f08c4c0 (6199)

2 replies

ggerganov Aug 19, 2025
Maintainer Author

Thanks for the data!

p.s. Accidentally, edited your comment instead of the guide - sorry about that :)

SoftwareRenderer Sep 10, 2025

It's amazing how the continued performance improvements have added up, even within the short period of the past 3 weeks: 40-50% PP and 15% TG improvement. Incredible work.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp2048	6403.85 ± 33.25
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp8192	6404.04 ± 10.92
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp16384	6188.01 ± 7.76
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp32768	5655.75 ± 28.18
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	tg128	197.21 ± 0.44

build: 00681df (6445)

kj-c0d3s · 2025-08-19T02:27:07Z

kj-c0d3s
Aug 19, 2025

Since I spent more time than I probably should have, here is some info for the list

🧱 Hardware Specifications

🖥️ CPU

Model: AMD Ryzen Threadripper 1950X 16-Core Processor
Cores/Threads: 16 cores / 32 threads
Base Clock: 2.2 GHz
Boost Clock: 3.75 GHz
Sockets: 1
NUMA Nodes: 1 (CPUs 0–31)

🧠 Memory (RAM)

Total Capacity: 64 GB (4 × 16 GB)
Speed: 3200 MT/s
Channels: Quad-channel
Type: DDR4

🎮 GPUs

5 × AMD Radeon Pro VII (Vega 20 (gfx906: xnack-), 16 GB HBM2 each)

GPU ID	Memory Vendor	VBIOS Version
GPU 0	Hynix	113-D1640600-104
GPU 1	Hynix	113-D1640600-104
GPU 2	Hynix	113-D1640600-104
GPU 3	Samsung	113-D1640600-104
GPU 4	Samsung	113-D1640600-104

Total VRAM: 80 GB
ECC Support: Enabled
IOMMU + HMM/SVM: Enabled (shared virtual memory for ROCm)
Firmware & ROCm: Custom-built ROCm HIP stack with Flash attention support to enable functionality outside official compatibility list

🧩 PCIe Configuration

Risers: 5 total
Bifurcation Cards: 2 × 16x-to-8x dual-split
Layout:
- 3 GPUs on straight x16 risers
- 2 GPUs connected via bifurcation (x8/x8) splitters

🐧 OS

Distribution: Ubuntu 24.04 LTS

🧠 Inference Benchmark Summary

🏃 Run Command

./llama-server \
  --model gpt-oss-120b-F16.gguf \
  --threads 16 \
  --no-mmap \    #Prevents hang at 75% model loading (for me anyway)
  --flash-attn \
  --prio 2 \
  --n-gpu-layers 99 \
  --temp 1.0 \
  --top-p 1.0 \
  --top-k 0 \
  --min-p 0 \
  --no-warmup \
  --ubatch-size 2048 \
  --jinja \
  --chat-template-kwargs '{"reasoning_effort": "medium"}' \
  --ctx-size 32768

🔍 Inference Performance: 9K Prompt + Generation (~11.4K tokens total)

📈 Performance Metrics

📤 Prompt Tokenization

Tokens: 9044
Time: 77,223.972 ms
Speed: ⚡ 117.1 tokens/sec

🧠 Token Generation

Tokens: 2381
Time: 134,268.908 ms
Speed: 🐢 17.7 tokens/sec

⚙️ Total Workload: ~11,425 tokens

Let me know if there is anything else you'd like to see or know that may be helpful to others

6 replies

ggerganov Aug 19, 2025
Maintainer Author

The --top-k 0 option is likely slowing text generation a lot.

kj-c0d3s Aug 20, 2025

TL;DR
same peformance with top-k at default, and at least with my setup split row is much worse performance even though it looks like its taxing the gpus more..

I ran it with no --top-k specified, expecting default 40, running prompt from same entry point as original test:

Prompt

Tokens: 9044
Time: 77004.813 ms
Speed: 117.4 t/s

Generation

Tokens: 2888
Time: 164342.089 ms
Speed: 17.6 t/s

switching back for straight compare and adding --split-mode row yields:

First yes it definitely burns all GPUs at same time, and slight different loading

Prompt

Tokens: 9044
Time: 95424.939 ms
Speed: 94.8 t/s

Generation

Tokens: 3041
Time: 212404.812 ms
Speed: 14.3 t/s

kj-c0d3s Aug 20, 2025

For giggles, I built latest vulkan:

Same original prompt entry point:

Prompt

Tokens: 9044
Time: 433164.602 ms
Speed: 20.9 t/s

Generation

Tokens: 3705
Time: 194148.47 ms
Speed: 19.1 t/s

is -fa not working or is there something else going on here? wild I get speed up on tg but my pp is abismal... sadge kj

kj-c0d3s Aug 20, 2025

silly question @ggerganov - is it possible to use ROCm for prompt processing and Vulkan for token gen?

nullnuller Aug 23, 2025

--split-mode row

Does it work seamlessly with --tensor-split ?

Tom94 · 2025-08-19T14:46:43Z

Tom94
Aug 19, 2025

More benchmarks for NVIDIA >64 GB. This one with the workstation edition of the RTX 6000 Pro Blackwell.

Crazy how the difference in performance to the 300W Max-Q version is only around 15%. I should start running my GPU at 300W as well to save some energy. 😅

gpt-oss-20b

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp2048	11521.95 ± 26.03
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp8192	10673.03 ± 22.35
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp16384	9772.06 ± 19.59
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	pp32768	8267.46 ± 15.58
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	2048	1	tg128	286.91 ± 0.22

build: a6d3cfe (6205)

gpt-oss-120b

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp2048	5518.07 ± 31.18
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp8192	5315.65 ± 21.91
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp16384	5012.78 ± 24.18
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	pp32768	4503.36 ± 31.57
gpt-oss ?B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	1	2048	1	tg128	196.31 ± 0.14

build: a6d3cfe (6205)

Edit: maybe worth noting that my GPU only draws around 390W out of the maximum 600W while running the benchmark. Probably hints at optimization opportunities.

1 reply

SoftwareRenderer Aug 19, 2025

Very interesting that it doesn't come anywhere near max power draw for this workload!

For reference, the Max-Q version draws around ~250W during pp, and ~280W during tg for this benchmark, measured in nvtop.

grigio · 2025-08-19T19:38:55Z

grigio
Aug 19, 2025

AMD Ryzen 7 7700

Id	Timestamp	Model	Input Tokens	Output Tokens	Prompt Processing	Generation Speed	Duration
12	8/19/2025, 11:11:18 AM	gpt-oss-20b-mxfp4	10,897	134	32.71 t/s	9.51 t/s	30.29s

/app/llama-server
      --model /models/gpt-oss-20b-mxfp4.gguf
      -c 0
      -fa
      --reasoning-format auto
      --no-warmup
      --chat-template-kwargs "{\"reasoning_effort\": \"high\"}"
      --seed 3407
      --repeat-penalty 1.05
      --jinja
      --chat-template-file /models/gpt-oss/chat_template.jinja
      --grammar-file /models/gpt-oss/cline.gbnf
      --temp 1.0
      --top-p 1.0
      --top-k 0.0
      --min-p 0.0
      -ngl 99
      --port 9999

0 replies

Spyro000 · 2025-08-19T19:51:28Z

Spyro000
Aug 19, 2025

Thanks for the guide!
Here are my results with Nvidia RTX 5070 12 GB, Ryzen 5 9600X, and 64 GB DDR5‑6000:

llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
  --n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

Speed: 62.21 tokens per second

The 20‑B model itself is working surprisingly well. Has anyone managed to connect it to LangChain?

5 replies

QuantiusBenignus Aug 20, 2025

This is actually quite interesting. Are you sure about the -ub batch size? I also have 12GB of VRAM on a RTX3060 (12288 MB total with 378 MB reserved for the Linux driver) and I cannot fit in the VRAM with those settings at 32K context (including -ncmoe 2). Need to drop -ub to the default of 512 to have 300MB spare. When I do, I get 60 tokens/sec (Ryzen 7 5700X, 32 GB DDR4 RAM). I am surprised that the difference in hardware generations (extra bandwidth of 5070 vs 3060, DDR5 vs DDR4 memory bandwidth etc.) results in such a small performance difference. Could it be the 6 cores of Ryzen 5 9600X vs. the 8 cores of 5700X? (Assuming that the --threads default of -1 automatically chooses the number of cores of the CPU)

Edit: CPU is Ryzen 7 5700X, sorry. In any case, the LLM works well, makes a good case to upgrade RAM to 64GB and load its 120B big brother.

Spyro000 Aug 20, 2025

Great results! I'd say the CPU and DDR5 are the bottleneck. When I run the model entirely on VRAM (with a small 2048 context window), I get 128 t/s.

QuantiusBenignus Aug 20, 2025

You are right. The extra compute power and bandwidth of the 5070 over the 3060 shines when there is no GPU to CPU (RAM) data jumps. Inference is almost twice as fast (128 t/s vs. 75 t/s). You should try the 120B with that much RAM and if you don't mind, post the results.

Spyro000 Aug 20, 2025

Not enough memory to run the 120B model reliably. I did manage to start it, though, and got 12t/s. I suppose it would run ok with more RAM.

QuantiusBenignus Aug 21, 2025

Bought 32GB extra RAM, and ran the unsloth UD-Q4_K_XL quant of gpt-oss-120b with -ncmoe 32 (both mmaped and -no-mmap). In either case I get 15 tok / sec generation and 85 tok / sec pp2048. The -ncmoe 32 setting leaves about 2.6 GB VRAM available on the GPU, but RAM is under moderate pressure with --no-mmap (only 2 GB RAM remains available). For lower RAM pressure let llama-server mmap the file and you should be OK in most cases (lower your swapiness just in case or try with --mlock ). Assuming your total model size is 63GB (the aforementioned quant or the MXFP4 from ggml-org). I think this is a reliable way to run the model, a true mid-tier LLM generating at reading speed on a Linux machine with a low-end NVIDIA GPU.

Thank you llama.cpp developers!

ceroma · 2025-08-19T22:48:51Z

ceroma
Aug 19, 2025

Some numbers for AMD Ryzen AI 9 HX 370 with Radeon 890M (64GB allocated to VRAM out of 128GB total RAM) using Vulkan:

Benchmarks for `gpt-oss-120b`

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -b 2048,4096 -ub 512,1024,2048,4096 -p 2048,8192
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp2048 |         92.62 ± 2.87 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp8192 |         84.11 ± 0.14 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |           tg128 |         18.98 ± 0.16 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp2048 |         84.88 ± 0.25 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp8192 |         81.49 ± 0.32 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |           tg128 |         19.01 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp2048 |         83.67 ± 0.34 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp8192 |         79.53 ± 0.11 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |           tg128 |         19.12 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp2048 |         83.73 ± 0.24 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp8192 |         79.40 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |           tg128 |         19.28 ± 0.06 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp2048 |         88.26 ± 0.28 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp8192 |         83.43 ± 0.13 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |           tg128 |         19.37 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp2048 |         84.85 ± 0.24 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp8192 |         81.42 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |           tg128 |         19.46 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp2048 |         83.54 ± 0.36 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp8192 |         79.29 ± 0.32 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |           tg128 |         19.54 ± 0.08 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |         82.78 ± 0.23 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp8192 |         74.58 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |           tg128 |         19.58 ± 0.02 |

build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 512,1024 --delay 120
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |           tg512 |         19.78 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg1024 |         19.08 ± 1.25 |

build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg2048 |         18.21 ± 2.01 |

build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 4096
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg4096 |         12.38 ± 4.12

build: f08c4c0d (6199)

Benchmarks for `gpt-oss-20b`

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 0,1 -p 0 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  0 |           tg512 |         27.06 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg512 |         27.48 ± 0.16 |

build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 128,256,1024,2048,4096 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg128 |         27.67 ± 0.26 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg256 |         27.38 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg1024 |         26.10 ± 2.62 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg2048 |         26.83 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg4096 |         17.85 ± 6.01 |

build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -n 0 -p 256,512,1024,2048,4096,8192 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           pp256 |        244.15 ± 1.59 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           pp512 |        285.28 ± 3.06 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        283.54 ± 0.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp2048 |        274.22 ± 2.84 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp4096 |       252.54 ± 11.57 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp8192 |        230.29 ± 8.31 |


build: f08c4c0d (6199)

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048,4096 -ub 512,1024,2048,4096 -n 0 -p 4096 --delay 60
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp4096 |       252.72 ± 11.14 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp4096 |        244.87 ± 6.57 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp4096 |        234.58 ± 6.17 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp4096 |        234.88 ± 6.50 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp4096 |        245.97 ± 9.11 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp4096 |        245.34 ± 6.65 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp4096 |        235.12 ± 6.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp4096 |        216.17 ± 6.05 |

build: f08c4c0d (6199)

Edit: re-ran benchmarks on latest build as of Sep 15 (tldr: almost 2x increase in t/s for prompt processing)

Benchmarks for `gpt-oss-120b`

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 1024 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        165.48 ± 1.59 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |           tg512 |         19.64 ± 0.06 |

build: 28c39da7c (6478)

Benchmarks for `gpt-oss-20b`

$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -p 1024 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        430.67 ± 9.91 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg512 |         27.85 ± 0.08 |

build: 28c39da7c (6478)

1 reply

traysh Sep 23, 2025

On Phoronix benchmarks, the performance was much higher, what could be the difference?
https://www.phoronix.com/review/amd-rocm-7-strix-halo/3

raymondlo84Fork · 2025-08-19T22:55:28Z

raymondlo84Fork
Aug 19, 2025

I got this running on Intel AI PC (Intel Core Ultra 7 258V with Vulkan!). The new GPU driver now can change the GPU memory allocation and so it can easily fit all 25 layers on the VRAM (shared).

llama_perf_sampler_print:    sampling time =       5.94 ms /    44 runs   (    0.14 ms per token,  7406.16 tokens per second)
llama_perf_context_print:        load time =   17569.42 ms
llama_perf_context_print: prompt eval time =    2044.57 ms /    84 tokens (   24.34 ms per token,    41.08 tokens per second)
llama_perf_context_print:        eval time =    4781.04 ms /    85 runs   (   56.25 ms per token,    17.78 tokens per second)
llama_perf_context_print:       total time =  301066.37 ms /   169 tokens
llama_perf_context_print:    graphs reused =         84

Video example:
https://youtu.be/C_9W19j0t_A

log.txt

P.S.
One more thing is the documentation was kinda missing this (https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#vulkan) so it won't compile with CURL by default: #9937

0 replies

prusswan · 2025-09-03T07:46:06Z

prusswan
Sep 3, 2025

Faster than I expected

models$ llama-bench -m ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -ngl 4 -p 2048,8192,16384,32768 -ub 4096 -b 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes (Cuda 13.0)

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	4	4096	4096	1	pp2048	396.53 ± 1.60
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	4	4096	4096	1	pp8192	597.68 ± 1.75
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	4	4096	4096	1	pp16384	580.34 ± 3.61
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	4	4096	4096	1	pp32768	544.41 ± 1.12
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	4	4096	4096	1	tg128	20.91 ± 0.06

build: 02c1813 (6344)

0 replies

dagbdagb · 2025-09-04T10:37:15Z

dagbdagb
Sep 4, 2025

I see benchmarks for pre-Hopper hardware. And MXFP4. ELI13, pretty please? Software emulation?

1 reply

pt13762104 Sep 5, 2025

From my understanding, MXFP4 is done using table lookup (there are only 16 values)

dstoc · 2025-09-05T01:04:49Z

dstoc
Sep 5, 2025

Edit: Sorry, this is probably only relevant if you are using the /completions API!

I wrote a grammar for guided generation of the Harmony response format. Ensures it more closely follows the specification. For tool calls, forces constraint json and applies the JSON grammar. This massively improves generation precision for me, especially for tool calls.

https://gist.github.com/dstoc/ab58a1829b3f504c64f08bee5e8c6ea6

I'm still new to gbnf so there are some issues (I think it prevents the model from saying <|end|> in a message, and ending with e.g. <|ret). Posted some questions in a separate discussion.

5 replies

aldehir Sep 5, 2025
Collaborator

There is already a grammar applied when tools are passed in, and they constrain to the tool schema instead of generic JSON. If you're having issues with that, it's preferable to fix it for everyone.

What kind of issues are you seeing? Do you happen to be using tool_choice = required?

dstoc Sep 5, 2025

There is already a grammar applied when tools are passed in

Where can I find that?

they constrain to the tool schema instead of generic JSON

Ahh! Good to know! Can I get some pointers to this too?

I'm not using the structured APIs or jinja templates, I'm using /completions directly with openai/harmony parser/renderer.

aldehir Sep 5, 2025
Collaborator

Oh, I see. Yeah, that works if you want to roll your own parsing. I'm curious what the builtin parsing with --jinja is lacking, is it prefill?

The grammar is dynamically generated from the tools passed in. You won't be able to do that with a grammar file, but you can generate the grammar and pass it in the grammar field IIRC.

llama.cpp/common/chat.cpp

Lines 1447 to 1506 in 4fd1242

    
           if (inputs.tools.is_array() && !inputs.tools.empty()) { 
        
               data.grammar_lazy = inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_REQUIRED; 
        
               data.grammar = build_grammar([&](const common_grammar_builder & builder) { 
        
                   // tool calls can appear in commentary or analysis channels 
        
                   auto channel = builder.add_rule("channel", "\"<|channel|>\" ( \"commentary\" | \"analysis\" )"); 
        
                   std::vector<std::string> tool_rules_recipient_in_role; 
        
                   std::vector<std::string> tool_rules_recipient_in_channel; 
        
                   foreach_function(inputs.tools, [&](const json & tool) { 
        
                       const auto & function = tool.at("function"); 
        
                       std::string name = function.at("name"); 
        
                       auto parameters = function.at("parameters"); 
        
                       builder.resolve_refs(parameters); 
        
                       tool_rules_recipient_in_role.push_back( 
        
                           builder.add_rule(name + "-call", 
        
                               "\"" + name + "\"" + channel + " \" <|constrain|>json\"? \"<|message|>\" " + 
        
                               builder.add_schema(name + "-args", parameters) 
        
                           ) 
        
                       ); 
        
                       tool_rules_recipient_in_channel.push_back( 
        
                           builder.add_rule(name + "-call", 
        
                               "\"" + name + "\"" + " \" <|constrain|>json\"? \"<|message|>\" " + 
        
                               builder.add_schema(name + "-args", parameters) 
        
                           ) 
        
                       ); 
        
                   }); 
        
                   auto recipient_in_role = builder.add_rule("recipient_in_role", 
        
                       "\"<|start|>assistant\"? \" to=functions.\" ( " + 
        
                       string_join(tool_rules_recipient_in_role, " | ") + " )" 
        
                   ); 
        
                   auto recipient_in_channel = builder.add_rule("recipient_in_channel", 
        
                       channel + " \" to=functions.\" ( " + 
        
                       string_join(tool_rules_recipient_in_channel, " | ") + " )" 
        
                   ); 
        
                   builder.add_rule("root", recipient_in_role + " | " + recipient_in_channel); 
        
                   // Trigger on tool calls that appear in the commentary channel 
        
                   data.grammar_triggers.push_back({ 
        
                       COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, 
        
                       "<\\|channel\\|>(commentary|analysis) to" 
        
                   }); 
        
                   // Trigger tool calls that appear in the role section, either at the 
        
                   // start or in the middle. 
        
                   data.grammar_triggers.push_back({ 
        
                       COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL, 
        
                       "^ to" 
        
                   }); 
        
                   data.grammar_triggers.push_back({ 
        
                       COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, 
        
                       "<\\|start\\|>assistant to" 
        
                   }); 
        
               }); 
        
           }

Your grammar is missing the recipient in the role section, e.g. <start>assistant to=.... This usually happens when the model forgoes thinking in a multi-turn scenario by leveraging the CoT from the last tool call. That said, I only rarely see it in the 120b model. Practically speaking, it's probably fine.

dstoc Sep 5, 2025

I'm curious what the builtin parsing with --jinja is lacking, is it prefill?

Yes, prefill is one issue. Another is that tool preambles can't be properly represented in the /v1/chat/completions output format -- they would have to appear as either reasoning or content. I'm also trying to keep parsing at the token level to avoid special tokens appearing where they shouldn't.

The grammar is dynamically generated from the tools passed in. You won't be able to do that with a grammar file, but you can generate the grammar and pass it in the grammar field IIRC.

Thanks! I think I have all the tools I need to build that out.

Your grammar is missing the recipient in the role section ...

Hmm, I was following the examples from: https://cookbook.openai.com/articles/openai-harmony -- to= is only shown in tool calling after channel: commentary?

aldehir Sep 5, 2025
Collaborator

In the fine print:

The recipient might be defined in the role or channel section of the header.

https://cookbook.openai.com/articles/openai-harmony#receiving-tool-calls

IgorWarzocha · 2025-09-07T07:57:38Z

IgorWarzocha
Sep 7, 2025

Hi! I figured I'd show my unusual Vulkan NVIDIA + AMD scenario for people who are on RTX 5070 or any 12GB VRAM card and have something older lying around. I am using my old RX 6600XT with no AI cores to run this, sitting in my secondary pcie gen3x4 slot.

Suffice to say I am pretty happy with how it's running, a lot better than offloading to CPU or trying to shuffle MoE layers between GPUs using the older regex style command. (~30-40 t/s on either options - Ryzen 5800X3D, b550, 32gb ddr3600 tuned ram).

I stick to 32k for size/performance/tensorsplit ratio. Yes, 89/11 split crashes. Table below, and a screenshot from NVTOP while running 32k.

Hope this helps someone :)

./build/bin/llama-bench -m "/home/igor/.lmstudio/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf" -ngl 99 -t 16 -fa 1 -b 512 -ub 512 -p 512,2048,8192,16384,32768 -ctk q8_0 -ctv q8_0 -ts 88/12

load_backend: loaded RPC backend from /home/igor/llama/llama-b6380-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = AMD Radeon RX 6600 XT (AMD open-source driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/igor/llama/llama-b6380-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/igor/llama/llama-b6380-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-haswell.so

test	t/s
pp512	3093.15 ± 29.85
pp2048	3028.15 ± 8.98
pp8192	2598.49 ± 4.17
pp16384	2172.46 ± 2.10
pp32768	1525.44 ± 0.65
tg128	95.60 ± 0.13

0 replies

narendrachaudhary51 · 2025-09-09T06:17:32Z

narendrachaudhary51
Sep 9, 2025

It seems entire prompt gets reprocessed for gpt-oss when doing an inference. This slows the response in second query onwards. This is not the behavior for LLMs such as qwen3. In other LLMs, response is quick because the KV cache is not reprocessed. What is the reason for KV cache prompt reprocessing in gpt-oss? is there a fix for it?

6 replies

Mushoz Sep 9, 2025

Increase the number of SWA KV cache checkpoints if you are getting cache misses with: --swa-checkpoints N

The default is 3, try a higher value. Alternatively, you can disable Sliding Window Attention (SWA) entirely through --swa-full. This greatly increases the amount of memory the KV cache uses though. So try increasing the checkpoints first.

narendrachaudhary51 Sep 9, 2025

I have tried using --swa-full but it doesn't help much. I will try increasing --swa-checkpoints N.
@ggerganov I observe in server logs that entire prompt is getting reprocessed. This is big issue on CPU. As compared to a GPU, prompt reprocessing is slow on CPUs and users have to wait a long time. Especially, starting from second or third query user experience gets really bad. I don't see this in behavior in qwen3 or other models at all.

ggerganov Sep 9, 2025
Maintainer Author

Maybe you are using an old version. This has been fixed shortly after the gpt-oss release and I just double-checked that the prompt is not being reprocessed. So the problem is in your end. If it persist, open a issue with full logs and repro.

narendrachaudhary51 Sep 9, 2025

I am running on a Emerald Rapids Xeon CPU socket with the following command. Commit id of code is - c252ce6. This is a recent commit.

numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10_

This is the server log. As you can see n_prompt_tokens goes from 86, 1916, 5877 in three queries. Here, n_tokens == n_prompt_tokens, which is not the case for other models such as qwen3.

srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 3 | task 12914 | processing task
slot update_slots: id 3 | task 12914 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 86
slot update_slots: id 3 | task 12914 | kv cache rm [0, end)
slot update_slots: id 3 | task 12914 | prompt processing progress, n_past = 86, n_tokens = 86, progress = 1.000000
slot update_slots: id 3 | task 12914 | prompt done, n_past = 86, n_tokens = 86
slot update_slots: id 3 | task 12914 | SWA checkpoint create, pos_min = 0, pos_max = 85, size = 3.025 MiB, total = 1/10 (3.025 MiB)
slot release: id 3 | task 12914 | stop processing: n_past = 1967, truncated = 0
slot print_timing: id 3 | task 12914 |
prompt eval time = 474.41 ms / 86 tokens ( 5.52 ms per token, 181.28 tokens per second)
eval time = 48978.99 ms / 1882 tokens ( 26.02 ms per token, 38.42 tokens per second)
total time = 49453.39 ms / 1968 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 2 | task 14797 | processing task
slot update_slots: id 2 | task 14797 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1916
slot update_slots: id 2 | task 14797 | kv cache rm [0, end)
slot update_slots: id 2 | task 14797 | prompt processing progress, n_past = 1916, n_tokens = 1916, progress = 1.000000
slot update_slots: id 2 | task 14797 | prompt done, n_past = 1916, n_tokens = 1916
slot update_slots: id 2 | task 14797 | SWA checkpoint create, pos_min = 1148, pos_max = 1915, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 2 | task 14797 | stop processing: n_past = 6010, truncated = 0
slot print_timing: id 2 | task 14797 |
prompt eval time = 10853.30 ms / 1916 tokens ( 5.66 ms per token, 176.54 tokens per second)
eval time = 125051.40 ms / 4095 tokens ( 30.54 ms per token, 32.75 tokens per second)
total time = 135904.70 ms / 6011 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 1 | task 18893 | processing task
slot update_slots: id 1 | task 18893 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5877
slot update_slots: id 1 | task 18893 | kv cache rm [0, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.348477
slot update_slots: id 1 | task 18893 | kv cache rm [2048, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.696954
slot update_slots: id 1 | task 18893 | kv cache rm [4096, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 5877, n_tokens = 1781, progress = 1.000000
slot update_slots: id 1 | task 18893 | prompt done, n_past = 5877, n_tokens = 1781
slot update_slots: id 1 | task 18893 | SWA checkpoint create, pos_min = 5109, pos_max = 5876, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 1 | task 18893 | stop processing: n_past = 8191, truncated = 0
slot print_timing: id 1 | task 18893 |
prompt eval time = 41512.46 ms / 5877 tokens ( 7.06 ms per token, 141.57 tokens per second)
eval time = 84743.58 ms / 2315 tokens ( 36.61 ms per token, 27.32 tokens per second)
total time = 126256.05 ms / 8192 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle

narendrachaudhary51 Sep 9, 2025

I filed an issue on this - #15894

bartlettroscoe · 2025-09-12T21:11:12Z

bartlettroscoe
Sep 12, 2025

With flash attention enabled, I am able to fit the entire gpt-oss-20b model with a near-full context window size of 125k in my NVIDIA RTX 5060 TI with 16GB of VRAM using:

llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 125000 --jinja -ub 2048 -b 2048 --n-cpu-moe 0 -fa on

Running an agentic refactoring with codex against this, produced:

print_info: general.name     = Gpt Oss 20b
llama_context: n_ctx         = 125000
Average tokens per second: 96.22 out of 33 samples

and nvidia-smi showed the VRAM usage 15751MiB / 16311MiB.

And I can run with the full 131k context with just one MoE layer moved to the CPU:

llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 1 -fa on

Running that same (non-deterministic) agentic codex refactoring gave:

print_info: general.name     = Gpt Oss 20b
llama_context: n_ctx         = 131072
Average tokens per second: 83.32 out of 39 samples

So we do take a bit of a hit in speed going from 125k to 131k in the max context window size and putting one of the MoE layers on the CPU. So you might as well just run the 125k context and get the 11% speedup from running the entire model on the GPU by going from full context 131k to 125k for the 5% reduction in context window size.

FYI: I wrote these little scripts to get this info out of the llama-server log output (which I send to a file):

0 replies

fkdosilovic · 2025-09-14T20:47:44Z

fkdosilovic
Sep 14, 2025

Benchmarks on NVIDIA GeForce RTX 4060 Ti (16GB) for gpt-oss-20b:

llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	2048	1	pp2048	3981.71 ± 5.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	2048	1	pp8192	3852.82 ± 6.32
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	2048	1	pp16384	3606.10 ± 1.39
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	2048	1	pp32768	3131.17 ± 3.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	2048	1	tg128	90.98 ± 0.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	4096	1	pp2048	3958.33 ± 11.12
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	4096	1	pp8192	3708.98 ± 2.92
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	4096	1	pp16384	3502.72 ± 2.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	4096	1	pp32768	3070.35 ± 3.56
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	4096	4096	1	tg128	90.92 ± 0.11

build: 261e6a2 (6471)

0 replies

Pfu · 2025-09-22T06:41:40Z

Pfu
Sep 22, 2025

Hi,

Thank you very much for you work!

My setup

AMD Ryzen Threadripper 3990X
AsRock TRX40 Creator
Ubuntu 24.04 LTS (6.14.0-29-generic)
ROCm 7.0.1
3x AsRock Taichi Radeon 7900 XTX 24GB
llama.ccp build: 4067f07 (6520)

gpt-oss:20b

❯ llama-bench -m gpt-oss-20b-GGUF -ngl 99 -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768 --split-mode none

model	size	params	backend	ngl	n_batch	n_ubatch	sm	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	2048	none	1	pp2048	4544.54 ± 16.61
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	2048	none	1	pp8192	3559.03 ± 14.48
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	2048	none	1	pp16384	2772.53 ± 9.84
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	2048	none	1	pp32768	1829.37 ± 5.86
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	2048	none	1	tg128	133.35 ± 0.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	4096	none	1	pp2048	4486.20 ± 6.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	4096	none	1	pp8192	3537.48 ± 6.57
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	4096	none	1	pp16384	2772.35 ± 5.95
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	4096	none	1	pp32768	1840.97 ± 4.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	2048	4096	none	1	tg128	133.47 ± 0.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	2048	none	1	pp2048	4490.88 ± 22.26
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	2048	none	1	pp8192	3543.82 ± 3.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	2048	none	1	pp16384	2770.36 ± 3.92
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	2048	none	1	pp32768	1836.09 ± 7.45
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	2048	none	1	tg128	133.47 ± 0.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	4096	none	1	pp2048	4488.11 ± 12.69
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	4096	none	1	pp8192	3480.95 ± 6.45
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	4096	none	1	pp16384	2670.57 ± 5.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	4096	none	1	pp32768	1773.49 ± 9.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	4096	4096	none	1	tg128	133.25 ± 0.13

❯ llama-batched-bench -m gpt-oss-20b-GGUF -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4 --split-mode none

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	0.00	0.996	128.51	0.996	128.51
0	128	2	256	0.000	0.00	1.383	185.05	1.383	185.05
0	128	4	512	0.000	0.00	1.684	304.10	1.684	304.10
2048	128	1	2176	0.522	3920.13	1.125	113.79	1.647	1320.90
2048	128	2	4352	0.893	4587.44	1.769	144.75	2.661	1635.21
2048	128	4	8704	1.777	4608.83	2.328	219.92	4.106	2120.05
8192	128	1	8320	2.303	3557.06	1.416	90.42	3.719	2237.40
8192	128	2	16640	4.578	3579.01	2.223	115.14	6.801	2446.64
8192	128	4	33280	9.121	3592.41	3.220	159.02	12.341	2696.68
16384	128	1	16512	5.910	2772.12	1.805	70.93	7.715	2140.26
16384	128	2	33024	11.779	2781.89	2.885	88.74	14.664	2252.08
16384	128	4	66048	23.473	2791.93	4.501	113.75	27.974	2361.02
32768	128	1	32896	17.686	1852.73	2.619	48.86	20.306	1620.03
32768	128	2	65792	35.455	1848.41	4.378	58.48	39.833	1651.70
32768	128	4	131584	71.422	1835.19	7.390	69.28	78.811	1669.61

gpt-oss:120b

❯ llama-bench -m gpt-oss-120b-GGUF -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	2048	1	pp2048	2200.46 ± 7.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	2048	1	pp8192	1856.09 ± 3.19
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	2048	1	pp16384	1528.45 ± 6.29
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	2048	1	pp32768	1077.15 ± 2.84
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	2048	1	tg128	57.59 ± 0.06

0 replies

CptTZ · 2025-09-27T04:00:37Z

CptTZ
Sep 27, 2025

Hi team, i'm mostly working with Vllm and TRT-LLM, trying out llama.cpp with 8 * H200, sharing my numbers:

gpt-oss:20b

./llama-bench -m ./mod/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp2048	9138.26 ± 67.94
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp8192	9854.43 ± 12.51
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp16384	9623.43 ± 6.93
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	pp32768	8313.67 ± 8.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	2048	1	tg128	226.20 ± 0.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp2048	9177.56 ± 54.44
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp8192	9874.25 ± 27.49
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp16384	9616.80 ± 21.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	pp32768	8304.30 ± 2.61
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	4096	1	tg128	226.29 ± 0.10
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp2048	9175.69 ± 35.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp8192	9802.65 ± 19.19
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp16384	9587.89 ± 15.55
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	pp32768	8288.19 ± 2.57
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	2048	1	tg128	225.96 ± 0.26
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp2048	9145.80 ± 47.77
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp8192	9230.66 ± 28.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp16384	9026.51 ± 6.35
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	pp32768	6923.33 ± 4.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	tg128	226.23 ± 0.09

build: 72b24d9 (6602)

I didn't try out 120b as 20b performance is already bad - i would expect a much higher tok/s on my system. Maybe I didn't use the correct configs for this benchmark (e.g. disabled tensor parallelism)?

1 reply

ggerganov Sep 27, 2025
Maintainer Author

You can enable pipeline parallelism by lowering the ubatch size - probably -ub 256 or -ub 128 would be OK for this system.

sorasoras · 2025-09-27T15:50:26Z

sorasoras
Sep 27, 2025

I am looking for a config for a system with 96G RAM and 8G VRAM GPU.
I compile cuda backend with vulkan backend.
load as much as model onto 8g VRAM and the rest at 780M.
What should I do?
@ggerganov

0 replies

rankaiyx · 2025-09-28T08:02:49Z

rankaiyx
Sep 28, 2025

./llama-bench -m gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -b 4096 -ub 4096 -p 2048,8192 --device cuda0
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	fa	dev	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	CUDA0	pp2048	1040.57 ± 2.37
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	CUDA0	pp8192	955.07 ± 0.87
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	4096	4096	1	CUDA0	tg128	66.43 ± 0.07

build: e6d65fb (6611)

0 replies

gesrtewrtwr · 2025-09-28T11:10:40Z

gesrtewrtwr
Sep 28, 2025

4070 @ 100W, 2x32GB DDR5-4800 ECC, 7800X3D, B650 TUF GAMING-PLUS and build: d8359f5f (6615):

llama-bench -m gpt-oss-20b-F16.gguf -t 1 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model	size	params	backend	ngl	threads	n_batch	fa	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	1	4096	1	pp2048	1042.36 ± 35.89
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	1	4096	1	pp8192	1012.44 ± 17.82
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	1	4096	1	tg128	33.68 ± 1.03

llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model	size	params	backend	ngl	threads	n_batch	fa	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp2048	1070.20 ± 53.75
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp8192	1028.48 ± 20.90
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	tg128	64.19 ± 0.52

llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192,32768 --n-cpu-moe 7

model	size	params	backend	ngl	threads	n_batch	fa	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp2048	903.72 ± 27.09
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp8192	888.57 ± 19.28
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp32768	795.59 ± 12.34
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	tg128	57.90 ± 0.20

4070 @ normal 200W:
llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model	size	params	backend	ngl	threads	n_batch	fa	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp2048	1215.83 ± 42.63
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	pp8192	1170.65 ± 16.06
gpt-oss 20B F16	12.83 GiB	20.91 B	RPC,Vulkan	99	4	4096	1	tg128	65.84 ± 0.24

3 replies

pt13762104 Oct 5, 2025

From my testing, it seems like the Vulkan backend suffers much more from offloading compared to e.g. CUDA. Could you try to test it again with the CUDA backend?

gesrtewrtwr Oct 6, 2025

Unfortunately there's no Linux CUDA build. But if the difference is within 20% or so, idc and vulkan is fine (not a fan of downloading the huge CUDA build each time, that's why I switched to the 25MB vulkan build and llama.cpp' webUI directly (the new webUI is a nice bonus too), instead of using any of the wrappers, at least so far, I don't need an of their features). If the 4070 had 16 GB VRAM, I could fully offload the LLM and get much faster speeds. When I got my 4070, LLMs weren't a thing, so get real goal would be to get the 16 GB VRAM GPU for this 20B LLM. A CUDA Linux build would be nice I guess, but what really matters for inferencing is the memory bandwidth and ofc that the LLM can fit well into the VRAM so that there's some space left for context.

pt13762104 Oct 7, 2025

I believe the difference is not 20%, but rather 2-3x (on prompt processing, tg looks fine). 4060 ti gets 3800 t/s pp. ~~But if you don't want the bloat, maybe try decreasing the batch size and offload more layers. That will probably work better on Vulkan.~~

fwaris · 2025-10-04T19:06:17Z

fwaris
Oct 4, 2025

quick tests on h100: 1K tps 20gb, with full context length (consider this as a very rough estimate)

docker run -d --name=gptoss20
--restart unless-stopped
--network=host
-v /apps/models:/models
--gpus all
gguf:server-cuda
--host 127.0.0.1 -ngl 99
--port 8081
-m /models/gpt-oss-20b-mxfp4.gguf
-c 0 -fa on --jinja --reasoning-format none

0 replies

SteelPh0enix · 2025-10-06T17:15:28Z

SteelPh0enix
Oct 6, 2025

I've got my Framework Desktop, and i've managed to build llama.cpp there, so i can provide some data.

Hardware specification

This is quite unusual machine, as it's an APU with shared memory (architecture-wise, it's similar to Apple M-series APUs). My configuration rocks 128GB of DDR5 (V)RAM, running @ 8000MT/s with theoretical throughput of 256GB/s (around 210-220GB/s in practice) - unfortunately it's soldered to the motherboard, but that's the price we have to pay for the performance we get on those modules. That memory can be fully used by both CPU and GPU (on Linux, on Windows you get up to 96GB of VRAM with 32GB of RAM, as you don't have GTT there).

ROCm (both the latest 6.x.x and 7.0.0rc1) is currently broken - it seems that it can't use the memory on this APU correctly, it reports completely wrong sizes of memories, in effect it cannot allocate more than 32GB of VRAM - while it's enough for stuff like embedding and small models, it's not worth it at this point IMHO.

Vulkan works great, and uses GTT correctly - so it can provide GPU the access to whole memory, even beyond what's allocated in BIOS, up to limits configured via modprobe. In my case, it's ~120GB of memory configured with following modprobe config:

options amdgpu gttsize=122800
options amdgpu vm_fragment_size=8
options ttm pages_limit=31457280
options ttm page_pool_size=15728640

along with amd_iommu=off option in kernel parameters and tuned with accelerator-performance profile. I recommend reading https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview for more details.

CPU: Ryzen AI MAX+ 395 (Strix Halo), 32 cores
GPU: AMD Radeon 8060S
RAM: 128GB DDR5, 8000MT/s (256GB/s theoretical throughput)
OS: NixOS 25.05 (Warbler) x86_64 w/ 6.16.9-zen1 kernel
llama.cpp commit hash: a23b9bd (b6697)

gpt-oss-20b

This is the same model that i've tested on my RX 7900XT here. Note that Vulkan reports no BF16 support, which may be the cause of hindered performance on that version (iirc this is mxfp4 mixed with bf16). I've checked other (u)batch sizes, 2048/512 is the optimal one.

> llama-bench -m ./gpt-oss-20b.auto.gguf -t 1 -fa 1 -mmp 1 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	threads	fa	test	t/s
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	pp2048	982.38 ± 2.81
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	pp8192	839.62 ± 2.14
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	pp16384	668.83 ± 2.50
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	tg128	48.33 ± 0.07
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	tg512	48.36 ± 0.06
gpt-oss 20B BF16	12.83 GiB	20.91 B	Vulkan	99	1	1	tg2048	47.78 ± 0.02

gpt-oss-120b

My gpt-oss-120b GGUF is Unsloth's Q6_K_XL quant, which may be the cause of better token generation performance compared to the BF16 20B model.

> llama-bench -m ./gpt-oss-120b-UD-Q6_K_XL.gguf -t 1 -fa 1 -mmp 1 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	threads	fa	test	t/s
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp2048	503.97 ± 3.40
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp8192	452.69 ± 1.33
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp16384	379.42 ± 1.83
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg128	53.46 ± 0.16
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg512	53.50 ± 0.07
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg2048	52.50 ± 0.05

I've re-ran the tests with -mmp 0 to check whether mmap affects performance on Vulkan, and i've got slightly better results, therefore i recommend disabling it.

> llama-bench -m ./gpt-oss-120b-UD-Q6_K_XL.gguf -t 1 -fa 1 -mmp 0 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	threads	fa	test	t/s
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp2048	509.18 ± 2.78
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp8192	455.13 ± 0.84
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	pp16384	381.19 ± 1.66
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg128	53.60 ± 0.01
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg512	53.61 ± 0.02
gpt-oss 120B Q6_K	58.93 GiB	116.83 B	Vulkan	99	1	1	tg2048	52.62 ± 0.01

0 replies

phoyd · 2025-10-07T16:25:52Z

phoyd
Oct 7, 2025

Setup is 2x AMD Instinct MI50 with 32GB each, rocm 6.3.4:

heat:~/Projects/llama.cpp$ ./build-rocm/bin/llama-bench -m models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

model	size	params	backend	ngl	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	2048	1	pp2048	1307.05 ± 1.95
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	2048	1	pp8192	1654.80 ± 1.99
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	2048	1	pp16384	1464.90 ± 65.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	2048	1	pp32768	1009.23 ± 41.57
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	2048	1	tg128	78.35 ± 0.22

build: 3a002af (6698)

3 replies

ggerganov Oct 8, 2025
Maintainer Author

With multiple GPUs, you can reduce the -ub while keeping -b high in order to enable pipeline parallelism. This should improve the prompt processing speed.

Mushoz Oct 8, 2025

Is there a good place where I can read up about -ub and -b and how they relate to each other and how they are different? Because I have never really understood the concept of logical and physical batchsizes.

ggerganov Oct 8, 2025
Maintainer Author

The original PR introducing pipeline parallelism and logic/physical batches should have some info: #6017

guide : running gpt-oss with llama.cpp #15396

Uh oh!

Uh oh!

ggerganov Aug 18, 2025 Maintainer

Overview

Minimum requirements

Relevant CLI arguments

Apple Silicon

✅ Devices with more than 96GB RAM

✅ Devices with less than 96GB RAM

✅ Devices with 16GB RAM

🟥 Devices with 8GB RAM

NVIDIA

✅ Devices with more than 64GB VRAM

✅ Devices with less than 64GB VRAM

✅ Devices with 16GB VRAM

✅ Devices with less than 16GB VRAM

AMD

✅ RX 7900 XT (20GB VRAM) using ROCm backend

✅ Few more low-end configurations

Tips

Frequently asked questions

Known issues

Replies: 47 comments · 100 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aldehir Aug 18, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aldehir Aug 18, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Aug 19, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aldehir Aug 18, 2025 Collaborator

Uh oh!

ggerganov Aug 19, 2025 Maintainer Author

Uh oh!

Uh oh!

aldehir Aug 19, 2025 Collaborator

Uh oh!

ggerganov
Aug 18, 2025
Maintainer

Replies: 47 comments 100 replies

aldehir Aug 18, 2025
Collaborator

aldehir Aug 18, 2025
Collaborator

ggerganov Aug 19, 2025
Maintainer Author

aldehir
Aug 18, 2025
Collaborator

ggerganov Aug 19, 2025
Maintainer Author

aldehir Aug 19, 2025
Collaborator