llama.cpp supports the new gpt-oss model in native MXFP4 format #15095

ggerganov · 2025-08-05T17:17:28Z

ggerganov
Aug 5, 2025
Maintainer

The new gpt-oss model is fully supported in native MXFP4 format across all major ggml backends, including CUDA, Vulkan, Metal and CPU at exceptional performance. This virtually brings the unprecedented quality of gpt-oss in the hands of everyone - from local AI enthusiasts to enterprises doing inference at the edge or in the cloud. The unique inference capabilities of ggml unlock a vast amount of use cases for the entire spectrum of consumer-grade hardware available on the market today - use cases that are impossible to support with any other inference framework in existence. Today, gpt-oss trained with the MXFP4 format, effectively “leaps” over the existing resource barriers and allows us to experience SOTA AI quality on our own personal devices.

The era of natively trained 4-bit local models has officially began and ggml will continue to lead the way forward!

Over the past 2 years the open-source developer community behind ggml has significantly grown. Together we built a scalable software infrastructure capable of supporting all of the needs of modern, low-level ML inference. More and more engineers, product builders, hardware vendors and researchers continue to discover and adopt what we have created. All of this wouldn’t have been possible without the dedicated open approach that this community has embraced from the very beginning of the project. The main difference today, compared to 2 years ago is that we are past the “hacking” phase of the development and it is time to focus on architecting and maintaining the correct implementation that will become the foundation of most local AI applications and products in the near future.

The primary goal of ggml-org will continue to be to help the community grow and create opportunities for everyone involved. We are more than ever open for support from the leaders in the AI field and today’s release is a prime example of what is possible with such coordinated and aligned efforts.

Special thanks to all maintainers, collaborators and contributors of ggml and related projects. Looking forward to many new developments together. Have fun!

ahmed-adly-khalil · 2025-08-05T17:30:48Z

ahmed-adly-khalil
Aug 5, 2025

YESSSSSSS :D

0 replies

bugmaschine · 2025-08-05T18:41:32Z

bugmaschine
Aug 5, 2025

It's just amazing to see how far the project has come. Thanks to everyone that makes this possible!

0 replies

ejones · 2025-08-05T19:13:13Z

ejones
Aug 5, 2025
Collaborator

Great work!

0 replies

RoadToNowhereX · 2025-08-05T22:43:53Z

RoadToNowhereX
Aug 5, 2025

What's the advantage of MXFP4 Quant? Does it has better performance than Q4_K_M or IQ4_XS Quant?

1 reply

Green-Sky Aug 6, 2025
Collaborator

It is the native format of the model. Looks like applying any quantization makes the model perform significantly worse.

MacJedi42 · 2025-08-05T22:54:17Z

MacJedi42
Aug 5, 2025

This is incredible :D! I thank you to all the developers involved

0 replies

gsgxnet · 2025-08-05T23:34:04Z

gsgxnet
Aug 5, 2025

Great achievement! Many thanks to all involved.

0 replies

axion66 · 2025-08-06T01:40:22Z

axion66
Aug 6, 2025

Incredibly awesome work!! Thanks to all the people making this possible. :)

0 replies

ServeurpersoCom · 2025-08-06T11:38:10Z

ServeurpersoCom
Aug 6, 2025
Collaborator

Yessss! Awesome work 😍

0 replies

Bixente-san · 2025-08-06T15:45:38Z

Bixente-san
Aug 6, 2025

Super, is it the release b6101 (https://github.com/ggml-org/llama.cpp/releases/tag/b6101) ?

0 replies

YannFollet · 2025-08-07T03:56:08Z

YannFollet
Aug 7, 2025

Thanks for the great job

0 replies

ElectriPixie · 2025-08-09T23:42:00Z

ElectriPixie
Aug 9, 2025

That is amazing! You are the best!

0 replies

sorasoras · 2025-08-17T08:43:48Z

sorasoras
Aug 17, 2025

How does applied MXFP4 to other model compare to other GGML quant？

0 replies

marvin-0042 · 2025-08-22T19:55:40Z

marvin-0042
Aug 22, 2025

Amazing! Thanks for the great job!

Just curious, has anyone done any accuracy test for gpt-oss-20B on non-Nvidia platforms? Thanks!

1 reply

gsgxnet Aug 30, 2025

See other "issue" my results on Mac Mini M4 Pro. #15396 (reply in thread)

ggzy12345 · 2025-08-30T07:23:18Z

ggzy12345
Aug 30, 2025

thank you for the great job! Does anyone know how to set the number of experts? I want to set it to 2 (default is 4) for better performance. I can set this number in lm studio. But I do not know how to set it with llama.cpp

1 reply

ggzy12345 Aug 30, 2025

ok. i find out myself. --override-kv gpt-oss.expert_used_count=int:2. But not sure the side effect.

ryzen88 · 2025-08-30T07:47:33Z

ryzen88
Aug 30, 2025

Maybe i am misunderstanding, although when not trained in it, would it be a good format to quant to from a source that is still fp16? or is it only beneficial when trained natively in it?

0 replies

SlavikCA · 2025-10-01T19:11:18Z

SlavikCA
Oct 1, 2025

Is there any update info about MXFP4 quants?

Can it be included to Wiki here: https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix

This week GLM-4.6 was released, now quants are available:

UD‑Q4_K_XL (204 GB) from unsloth
MXFP4 (199GB) https://huggingface.co/sm54/GLM-4.6-MXFP4_MOE

Any guidance on PROs & CONs of MXFP4 quants for this new model?

0 replies

llama.cpp supports the new gpt-oss model in native MXFP4 format #15095

Uh oh!

ggerganov Aug 5, 2025 Maintainer

Replies: 16 comments · 3 replies

Uh oh!

Uh oh!

Uh oh!

ejones Aug 5, 2025 Collaborator

Uh oh!

Uh oh!

Green-Sky Aug 6, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom Aug 6, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov
Aug 5, 2025
Maintainer

Replies: 16 comments 3 replies

ejones
Aug 5, 2025
Collaborator

Green-Sky Aug 6, 2025
Collaborator

ServeurpersoCom
Aug 6, 2025
Collaborator