Quant Cookers Basic Guide #434
ubergarm
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
thanks for this, can you point me where can i read a description of: |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Quant Cooking Basic Guide
Example workflow for cooking custom quants with ik_llama.cpp that I used to generate ubergarm/Qwen3-14B-GGUF.
Goal
The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.
In this guide we will download and quantize the dense model Qwen/Qwen3-14B on a gaming rig with a single 3090TI FE 24GB VRAM GPU.
We will use the latest ik_llama.cpp quants to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.
This guide does not get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.
Dependencies
This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.
Install any build essentials, git, etc. We will use
uv
for python virtual environment management to keep everything clean.Convert bf16 safetensors to bf16 gguf
I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.
Generate imatrix
Notes:
-ngl 32
but do whatever you need to run inferencing e.g.-ngl 99 -ot ...
etc.cd ik_llama.cpp ./build/bin/llama-imatrix \ --verbosity 1 \ -m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \ -f calibration_data_v5_rc.txt \ -o ./Qwen3-14B-BF16-imatrix.dat \ -ngl 32 \ --layer-similarity \ --ctx-size 512 \ --threads 16 mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/
Create Quant Recipe
I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or other imatrix statistics tools. Keep log files around with
./blah 2>&1 | tee -a logs/version-blah.log
.I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.
Perplexity
Run some benchmarks to compare your various quant recipes.
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf ./build/bin/llama-perplexity \ -m "$model" \ --ctx-size 512 \ --ubatch-size 512 \ -f wiki.test.raw \ -fa \ -ngl 99 \ --seed 1337 \ --threads 1
Final estimate: PPL = 9.0128 +/- 0.07114
Final estimate: PPL = 9.0281 +/- 0.07136
Final estimate: PPL = 9.0505 +/- 0.07133
Final estimate: PPL = 9.1034 +/- 0.07189
Final estimate: PPL = 9.1395 +/- 0.07236
KL-Divergence
You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.
I have a custom ~1.6MiB
ubergarm-kld-test-corpus.txt
made from whisper-large-v3 transcriptions in plain text format from some recent episodes of Buddha at the Gas Pump BATGAP YT Channel.Pass 1 Generate KLD Baseline File
The output kld base file can be quite large, this case it is ~55GiB. If
you can't run BF16, you could use Q8_0 as your baseline if necessary.
Pass 2 Measure KLD
This uses the above kld base file as input baseline.
This will report Perplexity on this corpus as well as various other statistics.
Final estimate: PPL = 14.8587 +/- 0.09987
Mean PPL(Q) : 14.846724 ± 0.099745
Median KLD: 0.000834
99.0% KLD: 0.004789
RMS Δp: 0.920 ± 0.006 %
99.0% Δp: 2.761%
Mean PPL(Q) : 14.881428 ± 0.099779
Median KLD: 0.004756
99.0% KLD: 0.041509
RMS Δp: 2.267 ± 0.013 %
99.0% Δp: 6.493%
Mean PPL(Q) : 14.934694 ± 0.100320
Median KLD: 0.006275
99.0% KLD: 0.060005
RMS Δp: 2.545 ± 0.015 %
99.0% Δp: 7.203%
Mean PPL(Q) : 14.922353 ± 0.100054
Median KLD: 0.006195
99.0% KLD: 0.063428
RMS Δp: 2.581 ± 0.015 %
99.0% Δp: 7.155%
Speed Benchmarks
Run some
llama-sweep-bench
to see how fast your quants are over various context lengths.model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf ./build/bin/llama-sweep-bench \ --model "$model" \ -fa \ -c 32768 \ -ngl 99 \ --warmup-batch \ --threads 1
Vibe Check
Always remember to actually run your model to confirm it is working properly and generating valid responses.
References
Beta Was this translation helpful? Give feedback.
All reactions