feat: add ParoQuant (pairwise rotation quantization) support#164
Open
spokvulcan wants to merge 7 commits intoml-explore:mainfrom
Open
feat: add ParoQuant (pairwise rotation quantization) support#164spokvulcan wants to merge 7 commits intoml-explore:mainfrom
spokvulcan wants to merge 7 commits intoml-explore:mainfrom
Conversation
Add support for loading PARO-quantized models (AutoAWQ format) with pairwise Givens rotation. Pre-rotates weights at load time to eliminate per-token rotation kernel dispatch, yielding +23% generation throughput. Key changes: - RotateQuantizedLinear: Metal rotation kernel + weight pre-rotation - ParoQuantLoader: AWQ conversion, loading, disk caching of pre-rotated weights - Qwen35: in_proj_b/a fusion via sanitize for single matmul dispatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sanitize-based fusion only concatenated .weight but not .scales and .biases, leaving the fused module unquantized and causing shape mismatches when loading standard quantized Qwen3.5 checkpoints. Reverts to separate in_proj_b / in_proj_a modules from the base branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PARO checkpoints store in_proj_b + in_proj_a as a single fused in_proj_ba key. Add sanitize logic to split it back into separate in_proj_b / in_proj_a weights (and scales/biases if quantized). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On hybrid models (e.g. Qwen 3.5 mixing attention + Mamba layers), cache[0] is a MambaCache with offset=0. The guard failed and no layers got quantized. Now finds the first KVCacheSimple entry instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 8 unit tests covering pair packing, AutoAWQ conversion, quantization round-trip, and pre-rotation equivalence. Document the channel-scales asymmetry (diag(s²) factor) on callAsFunction for future maintainers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
It is not appropriate to pre-rotate the weights. The purpose of the rotations is to make the weights more quantization-friendly. If you rotate the weights then dequantize them the rotation will have no effects at all. |
Contributor
Author
|
@liang2kl Thank you for pointing this out I've removed the pre-rotation and results improved significantly. On our 14-scenario tool-calling agent benchmark (file edits, multi-step instructions, exact string matching). The Qwen 3.5 4B PARO now outperforms standard 8-bit quantization on scenarios passed, duplicate rate, throughput, and memory which is a great result. I believe there's still room to squeeze more decode performance out of this, so I'll keep iterating. Thanks again for the feedback! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for ParoQuant (ICLR 2026) INT4-quantized models. ParoQuant uses learned pairwise Givens rotations to suppress weight outliers before quantization, achieving ~2.4% accuracy improvement over standard AWQ on reasoning tasks.
What's included:
RotateQuantizedLinear— applies channel-scaling + pairwise Givens rotation to activations at runtime via Metal kernel, then feeds into standardquantizedMatmulParoQuantLoader— loads AutoAWQ-format checkpoints, converts weight layout, and patches rotation layers with their learned parameters (theta, pairs, channel_scales)in_proj_basplit in sanitize for PARO checkpoint compatibility, hybrid cache handling for mixed attention/Mamba layersAgent benchmark results
We run a 14-scenario tool-calling benchmark (file reads, edits, multi-step instructions, clarification requests) on Apple Silicon. Results are multi-run averages.
The 9B PARO outperforms 4.5-bit mixed-precision (OptiQ) on scenarios passed, duplicate rate, and throughput while using comparable memory.
Key finding: an earlier version pre-baked rotations into quantized weights for +23% throughput. This degraded structured output quality on the 9B model (5/14 passed, 11% dups). Applying rotation to activations at runtime instead — keeping weights in their original INT4 form — restored accuracy at a small throughput cost.
Test plan