Hybrid quants #16322

espentveit · 2025-09-28T23:16:48Z

espentveit
Sep 28, 2025

I've been experimenting with hybrid quantization, and it seems like an interesting direction to explore further. The goal is to give smaller quantization levels a better chance of producing high-quality results, essentially adding some gradients in between the usual options.

Right now, it runs on either 100% GPU or 100% CPU (no mixed execution yet). The implementation is GPT-assisted and currently serves as a proof of concept.

Edit: After some research it seems to be a crude variant of SpQR, so not bringing anything new to the table.

Hybrid quantization "helper overlays"

What it is
Adds an optional "hybrid" path where small auxiliary per-weight tensors (.helper) capture the highest-error slices of quantized weights. At runtime we blend these helpers into the matmul output only for the selected rows, recovering quality with modest size/latency overhead.

High-level

New runtime switch to use or ignore helper overlays when present (default: off).
Quantizer can generate helper overlays with tunable type, coverage, and tile size.
Loader/graph/runtime wire the helpers and only touch affected output rows.

User-facing flags & UX

Run / tools/run
- --hyb-enable – load/evaluate helper overlays if present
- --hyb-disable – ignore helpers (default behavior)
Quantizer (tools/quantize)
- --hyb-enable – generate helpers (defaults: q8_0, fraction 0.2, tile size 128)
- --hyb-helper-type <ggml_type> e.g. q8_0
- --hyb-helper-fraction <0.0–1.0> portion of output-channel tiles to capture
- --hyb-tile-size <N> tile along the output-channel dim (default 128)
- --hyb-disable – turn helper generation off
CLI help & logs
- If helpers exist but are disabled, loader logs count, avg coverage %, and MiB.
- When enabled, startup logs *** STARTING INFERENCE WITH HYBRID QUANTIZATION *** and a helper summary.

What the helpers contain

For 2-D weight matrices, the quantizer scores tiles (by tile_size) using FP32 reconstruction error.
It selects a contiguous window of tiles with max error given hyb_helper_fraction, requantizes those rows to helper_type, and stores them as <tensor>.helper.
GGUF metadata per base tensor:
- llama.hyb.<name>.tile_size (u32)
- llama.hyb.<name>.helper_dtype (string)
- llama.hyb.<name>.helper_blocks ([start_tile, end_tile])
- llama.hyb.<name>.helper_fraction (f32 actual coverage)

Runtime application (graph path)

On matmul (build_lora_mm / _mm_id):

Compute base output.
Compute helper output for the selected row range.
delta = helper_out - base_out.
Accumulate delta into the corresponding rows of the result.

API / params surface

common/common.h: common_params_model::hyb_enable (default false)
include/llama.h:
- llama_model_params::hyb_enable (load-time toggle)
- llama_model_quantize_params: hyb_enable, hyb_helper_fraction, hyb_helper_type, hyb_tile_size
llama_model exposes:
- has_hybrid_helpers()
- get_hybrid_helper(base_tensor) → rows/start/tile_size/fraction/ggml tensor
- print_info() reports helper count, avg coverage, extra MiB.

Loader / GGUF

Detects .helper tensors, parses llama.hyb.* metadata, and tracks helper tensor count, total bytes, and average fraction.
Strict tensor-count validation when helpers are disabled; informational logging when enabled.

Defaults & compatibility

Off by default at runtime; models without helpers are unaffected.
If helpers are present but you don’t pass --hyb-enable, they’re ignored.
Quantizer only writes helpers when explicitly requested.

Quick start

Quantize with helpers

./build/bin/llama-quantize \
  --hyb-enable \
  --hyb-helper-type q4_0 \
  --hyb-helper-fraction 0.25 \
  --hyb-tile-size 128 \
  Qwen3-8B-BF16.gguf Qwen3-8B-q2_hyb-f25-q40.gguf Q2_K

Run using helpers

./build/bin/llama-cli \
  -m ./Qwen3-8B-q2_hyb-f25-q40.gguf \
  --hyb-enable \
  ...

Bench table (Qwen3-8B, various helper configs)

Very small wiki for quick perplexity (39kb) tests.

Q2 (pure)
16.2050 ± 0.7387

src_model	base_quant	helper_type	helper_fraction	tile	ppl	ppl_err	sec/pass	tok/s	model_vram	device_used
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.15	128	13.9688	0.62372	11.28	406.00	3536	4810
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.20	128	13.3421	0.60946	11.98	380.10	3724	5028
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.20	256	13.6533	0.62694	12.20	373.32	3745	5049
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.25	128	13.0780	0.59624	12.55	360.54	3944	5278
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.30	128	12.7932	0.58414	13.14	343.13	4163	5526
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.30	64	12.8540	0.58902	13.04	346.27	4148	5511
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.30	256	12.9182	0.59829	12.98	348.50	4143	5506
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.35	128	12.3967	0.56566	13.54	329.83	4352	5745
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.40	128	12.1747	0.55352	14.26	311.33	4551	5974
Qwen3-8B-BF16.gguf	Q2_K	q4_0	0.50	128	11.6322	0.52356	15.21	290.49	4959	6449
Qwen3-8B-BF16.gguf	Q2_K	q5_0	0.25	128	13.1536	0.60960	13.04	344.65	4169	5503
Qwen3-8B-BF16.gguf	Q2_K	q4_K	0.30	128	12.9130	0.59472	13.51	336.00	4163	5526

Experimental code:

https://github.com/espentveit/llama.cpp-hybridquant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid quants #16322

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Hybrid quants #16322

Uh oh!

Uh oh!

espentveit Sep 28, 2025

Hybrid quantization "helper overlays"

High-level

User-facing flags & UX

What the helpers contain

Runtime application (graph path)

API / params surface

Loader / GGUF

Defaults & compatibility

Quick start

Bench table (Qwen3-8B, various helper configs)

Replies: 0 comments

espentveit
Sep 28, 2025