Hybrid quants #16322
Closed
espentveit
started this conversation in
General
Hybrid quants
#16322
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been experimenting with hybrid quantization, and it seems like an interesting direction to explore further. The goal is to give smaller quantization levels a better chance of producing high-quality results, essentially adding some gradients in between the usual options.
Right now, it runs on either 100% GPU or 100% CPU (no mixed execution yet). The implementation is GPT-assisted and currently serves as a proof of concept.
Edit: After some research it seems to be a crude variant of SpQR, so not bringing anything new to the table.
Hybrid quantization "helper overlays"
What it is
Adds an optional "hybrid" path where small auxiliary per-weight tensors (
.helper
) capture the highest-error slices of quantized weights. At runtime we blend these helpers into the matmul output only for the selected rows, recovering quality with modest size/latency overhead.High-level
User-facing flags & UX
--hyb-enable
– load/evaluate helper overlays if present--hyb-disable
– ignore helpers (default behavior)--hyb-enable
– generate helpers (defaults:q8_0
, fraction0.2
, tile size128
)--hyb-helper-type <ggml_type>
e.g.q8_0
--hyb-helper-fraction <0.0–1.0>
portion of output-channel tiles to capture--hyb-tile-size <N>
tile along the output-channel dim (default 128)--hyb-disable
– turn helper generation off*** STARTING INFERENCE WITH HYBRID QUANTIZATION ***
and a helper summary.What the helpers contain
tile_size
) using FP32 reconstruction error.hyb_helper_fraction
, requantizes those rows tohelper_type
, and stores them as<tensor>.helper
.llama.hyb.<name>.tile_size
(u32)llama.hyb.<name>.helper_dtype
(string)llama.hyb.<name>.helper_blocks
([start_tile, end_tile])llama.hyb.<name>.helper_fraction
(f32 actual coverage)Runtime application (graph path)
On matmul (
build_lora_mm
/_mm_id
):delta = helper_out - base_out
.delta
into the corresponding rows of the result.API / params surface
common/common.h
:common_params_model::hyb_enable
(default false)include/llama.h
:llama_model_params::hyb_enable
(load-time toggle)llama_model_quantize_params
:hyb_enable
,hyb_helper_fraction
,hyb_helper_type
,hyb_tile_size
llama_model
exposes:has_hybrid_helpers()
get_hybrid_helper(base_tensor)
→ rows/start/tile_size/fraction/ggml tensorprint_info()
reports helper count, avg coverage, extra MiB.Loader / GGUF
.helper
tensors, parsesllama.hyb.*
metadata, and tracks helper tensor count, total bytes, and average fraction.Defaults & compatibility
--hyb-enable
, they’re ignored.Quick start
Quantize with helpers
Run using helpers
Bench table (Qwen3-8B, various helper configs)
Very small wiki for quick perplexity (39kb) tests.
Q2 (pure)
16.2050 ± 0.7387
Experimental code:
https://github.com/espentveit/llama.cpp-hybridquant
Beta Was this translation helpful? Give feedback.
All reactions