guide : running gpt-oss with llama.cpp #15396
Replies: 1 comment 6 replies
-
I can provide some numbers for AMD part of the guide. My hardware is RX 7900 XT (20GB VRAM) + Ryzen 9 5900X + 32GB of RAM, running on latest Arch Linux with locally built llama.cpp version 6194 (3007baf), built with ROCm 6.4.1-1 (from official Arch repo) Pulled the gpt-oss-20b repository and converted to GGUF using 7900XT can load the full 20B model with full context without offloading MoE layers to CPU (although barely, because it will fill up the whole VRAM), by running llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa With that, i get generation speeds (as reported by llama.cpp webui) at ~94 tokens/second, slowly going down as the context fills up. I've also tested whether setting K/V cache quantization would help with model size or performance, but the result was... bad, performance was halved and CPU got involved... is this because of mxfp4 format of gpt-oss? I'd also like to note that my PC likes to hang when i fill up my VRAM to the brim, so i've also checked out how gpt-oss-20b behaves when i off-load MoE layers to CPU. When running with all MoE layers on CPU, as below: llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -cmoe my GPU VRAM usage (as reported by btop) is around 10GB, RAM usage went up only ~2GB. However, the performance took a major 80% hit, as now my generation speed is in ~20tok/s - CPU takes most of the load. If you have better CPU and faster RAM (i'm still running dual-channel DDR4s @ 3200MHz CL16, mind you), you probably will get better results. I wonder how X3Ds behave in that case... I assume that gpt-oss-20b has 24 MoE layers, so let's see how it behaves when i load only, let's say, 4 onto CPU: llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 4 VRAM is at 18GB (previously it was at 19, as reported by btop, so there's a decrease), RAM usage went up by around 1.5GB, generation speed is ~60tok/s. Neat, this is usable. How about 8 layers? llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 8 In that case, i get 16GB VRAM usage, ~1.5GB RAM bump as previously, and generation speed went down to 38 tokens/s. Still pretty usable. How about 16 layers? llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 16 VRAM: 13GB, RAM: as previously, not more than 2GB, generation speed: 27-25tok/s, this is getting pretty bad. As mentioned before - your results may vary, i'm not running current-gen top-tier hardware and IIRC the largest performance bottleneck will be the RAM/PCIe link speed anyway - i'm pretty curious to see what the performance with this GPU is on more recent platform, especially with an X3D CPU. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Note
This guide is a live document. Feedback and benchmark numbers are welcome - the guide will be updated accordingly.
Overview
This is a detailed guide for running the new gpt-oss models locally with the best performance using
llama.cpp
. The guide covers a very wide range of hardware configurations. Thegpt-oss
models are very lightweight so you can run them efficiently in surprisingly low-end configurations.Obtaining `llama.cpp` binaries for your system
Make sure you are running the latest release of
llama.cpp
: https://github.com/ggml-org/llama.cpp/releasesObtaining the `gpt-oss` model data (optional)
The commands used below in the guide will automatically download the model data and store it locally on your device. So this step is completely optional and provided for completeness.
The original models provided by OpenAI are here:
First, you need to manually convert them to GGUF format. For convenience, we host pre-converted models here in ggml-org.
Pre-converted GGUF models:
Tip
Running the commands below will automatically download the latest version of the model and store it locally on your device for later usage. A WebUI chat and an OAI-compatible API will become available on localhost.
Note
This guide currently covers general chat use cases with llama-server. Tool-calling and agentic use cases are mostly supported via the OAI-compatible API of
llama-server
, but some minor issues are still being resolved, both onllama.cpp
side and in the reference upstream model implementations (chat template, harmony). As things get more polished in the next few days, you should be able to use the commands below in combination with your favorite 3rd party applications that support the OAI interface (such as chat interfaces, coding agents, etc.).Minimum requirements
Here are some hard memory requirements for the 2 models. These numbers could vary a little bit by adjusting the CLI arguments, but should give a good reference point.
Note
It is not necessary to fit the entire model in VRAM to get good performance. Offloading just the attention tensors and the KV cache in VRAM and keeping the rest of the model in the CPU RAM can provide decent performance as well. This is taken into account in the rest of the guide.
Relevant CLI arguments
Using the correct CLI arguments in your commands is crucial for getting the best performance for your hardware. Here is a summary of the important flags and their meaning:
-hf
curl
from the respective model repository-c
gpt-oss
models have a maximum context of 128k tokens. Use-c 0
to set to the model's default-ub N -b N
N
during processing. Larger size increases the size of compute buffers, but can improve the performance in some cases-fa
--n-gpu-layers N
N
to offload to the GPU. For this guide, keep this number at-ngl 99
--n-cpu-moe N
N
to keep on the CPU. This is used in hardware configs that cannot fit the models fully on the GPU. The specific value depends on your memory resources and finding the optimal value requires some experimentation--jinja
llama.cpp
to use the Jinja chat-template embedded in the GGUF model fileApple Silicon
Apple Silicon devices have unified memory that is seamlessly shared between the CPU and GPU. For optimal performance it is recommended to not exceed 70% of the total memory that your device has.
Tip
Install the latest
llama.cpp
package from Homebrew with:Tip
To increase the amount of RAM available to the
llama-server
process, use the following command:# on a 192GB machine, raise the limit from 154GB (default) to 180GB sudo sysctl iogpu.wired_limit_mb=180000
✅ Devices with more than 96GB RAM
The M2 Max, M3 Max, M4 Max, M1 Ultra, M2 Ultra, M3 Ultra, etc. chips can run both models at full context:
🟢 Benchmarks for `gpt-oss-20b`
build: 79c1160 (6123)
🟢 Benchmarks on M2 Ultra (192 GB) for `gpt-oss-120b`
build: 79c1160 (6123)
✅ Devices with less than 96GB RAM
The small
gpt-oss-20b
model can run efficiently on Macs with at least 16GB RAM:🟢 Benchmarks on M4 Max (36GB) for `gpt-oss-20b`
build: 79c1160 (6123)
🟢 Benchmarks on M1 Pro (32GB) for `gpt-oss-20b`
build: 79c1160 (6123)
🟥 Devices with 8GB RAM
Unfortunately, you are out of luck. The
gpt-oss
models are not possible to run on Macs with that small amount of memory.NVIDIA
✅ Devices with more than 64GB VRAM
With more than 64B VRAM, you can run both models by offloading everything (both the model and the KV cache) to the GPU(s).
🟡 TODO: benchmarks for `gpt-oss-20b`
TODO
🟡 TODO: benchmarks for `gpt-oss-120b`
TODO
✅ Devices with less than 64GB VRAM
In this case, you can fit the small
gpt-oss-20b
model fully in VRAM for optimal performance.🟡 TODO: benchmarks for `gpt-oss-20b`
TODO
The large model has to be partially kept on the CPU.
🟡 TODO: add commands for
gpt-oss-120b
✅ Devices with 16GB VRAM
For example: NVIDIA V100
This config is just at the edge to fit the full context of
gpt-oss-20b
in VRAM, so we have to restrict the maximum context down to 32k tokens.🟢 Benchmarks on NVIDIA V100 (16GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
build: 228f724 (6129)
Running the large
gpt-oss-120b
model with 16GB of VRAM requires to keep some of the layers on the CPU since it does not fit completely in VRAM:✅ Devices with less than 16GB VRAM
For this config, it is recommended to tell
llama.cpp
to run the entire model on the GPU while keeping enough layers on the CPU. Here is a specific example with an RTX 2060 8GB machine:Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too:
# gpt-oss-120b, 32k context, 35 layers on the CPU llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa --n-cpu-moe 35
Tip
For more information about how to adjust the CPU layers, see the "Tips" section at the end of this guide.
AMD
Note
If you have AMD hardware, please provide feedback about running the
gpt-oss
models on it and the performance that you observe. See the sections above for what kind of commands to try and try to adjust respectively.With AMD devices, you can use either the ROCm or the Vulkan backends. Depending on your specific hardware, the results can vary.
✅ RX 7900 XT (20GB VRAM) using ROCm backend
🟢 Benchmarks on M2 Ultra (192 GB) for `gpt-oss-20b`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 3007baf (6194)
More information: #15396 (comment)
Tips
Determining the optimal number of layers to keep on the CPU
Good general advice for most MoE models would be to offload the entire model, and use
-n-cpu-moe
to keep as many MoE layers as necessary on the CPU. The minimum amount of VRAM to do this with the 120B model is about 8GB, below that you will need to start reducing context size and the number of layers offloaded. You can get for example about 30 t/s at zero context on a 5090 with--n-cpu-moe 21
.Caveat: on Windows it is possible to allocate more VRAM than available, and the result will be slow swapping to RAM and very bad performance. Just because the model loads without errors, it doesn't mean you have enough VRAM for the settings that you are using. A good way to avoid this is to look at the "GPU Memory" in Task Manager and check that it does not exceed the GPU VRAM.
Example on 5090 (32GB):

good,
--n-cpu-moe 21
, GPU Memory < 32:bad,

--n-cpu-moe 20
, GPU Memory > 32:Using `gpt-oss` + `llama.cpp` with coding agents (such as Claude Code)
Setup the coding agent of your choice to look for a localhost OAI endpoint (see Tutorial: Offline Agentic coding with llama-server #14758)
Start
llama-server
like this:Configure the default sampling and reasoning settings
When starting a
llama-server
command, you can change the default sampling and reasoning settings like so:Note that these are just the default settings and they could be overridden by the client connecting to the
llama-server
.Frequently asked questions
Q: Which quants to use?
Always use the original MXFP4 model files. The
gpt-oss
models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent toggml
'sQ4_0
. The main difference withQ4_0
is that the MXFP4 models get to keep their full quality. This means that no quantization in the usual sense is necessary.Q: What sampling parameters to use?
OpenAI recommends:
temperature=1.0 and top_p=1.0
.Do not use repetition penalties! Some clients tend to enable repetition penalties by default - make sure to disable those.
Known issues
Some rough edges in the implementation are still being polished. Here is a list of issue to keep track of:
gpt-oss-120b
when using Vulkan #15274Beta Was this translation helpful? Give feedback.
All reactions