Skip to content

feat: add ParoQuant (pairwise rotation quantization) support#164

Open
spokvulcan wants to merge 7 commits intoml-explore:mainfrom
spokvulcan:feat/paroquant-support
Open

feat: add ParoQuant (pairwise rotation quantization) support#164
spokvulcan wants to merge 7 commits intoml-explore:mainfrom
spokvulcan:feat/paroquant-support

Conversation

@spokvulcan
Copy link
Copy Markdown
Contributor

@spokvulcan spokvulcan commented Mar 27, 2026

Summary

Adds support for ParoQuant (ICLR 2026) INT4-quantized models. ParoQuant uses learned pairwise Givens rotations to suppress weight outliers before quantization, achieving ~2.4% accuracy improvement over standard AWQ on reasoning tasks.

What's included:

  • RotateQuantizedLinear — applies channel-scaling + pairwise Givens rotation to activations at runtime via Metal kernel, then feeds into standard quantizedMatmul
  • ParoQuantLoader — loads AutoAWQ-format checkpoints, converts weight layout, and patches rotation layers with their learned parameters (theta, pairs, channel_scales)
  • Qwen 3.5 fixes — in_proj_ba split in sanitize for PARO checkpoint compatibility, hybrid cache handling for mixed attention/Mamba layers
  • 7 unit tests (pair packing, AWQ conversion, quantization round-trip)

Agent benchmark results

We run a 14-scenario tool-calling benchmark (file reads, edits, multi-step instructions, clarification requests) on Apple Silicon. Results are multi-run averages.

Model Runs Passed Tool Acc Dup Rate tok/s Peak Mem
Qwen3.5-9B PARO (INT4) 4 7.2/14 85.7% 0.9% 51 9.0 GB
Qwen3.5-9B OptiQ (4.5-bit) 4 6.2/14 84.3% 3.1% 40 8.3 GB
Qwen3.5-4B PARO (INT4) 7 5.6/14 77.5% 9.7% 67 4.2 GB
Qwen3.5-4B (8-bit) 7 4.3/14 79.6% 12.3% 58 5.6 GB

The 9B PARO outperforms 4.5-bit mixed-precision (OptiQ) on scenarios passed, duplicate rate, and throughput while using comparable memory.

Key finding: an earlier version pre-baked rotations into quantized weights for +23% throughput. This degraded structured output quality on the 9B model (5/14 passed, 11% dups). Applying rotation to activations at runtime instead — keeping weights in their original INT4 form — restored accuracy at a small throughput cost.

Test plan

  • Unit tests pass
  • End-to-end inference with Qwen3.5-4B-PARO and Qwen3.5-9B-PARO
  • Non-PARO Qwen 3.5 models still load correctly (regression check)
  • Agent benchmark on 4 model variants (table above)

spokvulcan and others added 5 commits March 28, 2026 00:25
Add support for loading PARO-quantized models (AutoAWQ format) with
pairwise Givens rotation. Pre-rotates weights at load time to eliminate
per-token rotation kernel dispatch, yielding +23% generation throughput.

Key changes:
- RotateQuantizedLinear: Metal rotation kernel + weight pre-rotation
- ParoQuantLoader: AWQ conversion, loading, disk caching of pre-rotated weights
- Qwen35: in_proj_b/a fusion via sanitize for single matmul dispatch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sanitize-based fusion only concatenated .weight but not .scales
and .biases, leaving the fused module unquantized and causing shape
mismatches when loading standard quantized Qwen3.5 checkpoints.

Reverts to separate in_proj_b / in_proj_a modules from the base branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PARO checkpoints store in_proj_b + in_proj_a as a single fused
in_proj_ba key. Add sanitize logic to split it back into separate
in_proj_b / in_proj_a weights (and scales/biases if quantized).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On hybrid models (e.g. Qwen 3.5 mixing attention + Mamba layers),
cache[0] is a MambaCache with offset=0. The guard failed and no
layers got quantized. Now finds the first KVCacheSimple entry instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 8 unit tests covering pair packing, AutoAWQ conversion, quantization
round-trip, and pre-rotation equivalence. Document the channel-scales
asymmetry (diag(s²) factor) on callAsFunction for future maintainers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@liang2kl
Copy link
Copy Markdown

It is not appropriate to pre-rotate the weights. The purpose of the rotations is to make the weights more quantization-friendly. If you rotate the weights then dequantize them the rotation will have no effects at all.

@spokvulcan
Copy link
Copy Markdown
Contributor Author

spokvulcan commented Mar 29, 2026

@liang2kl Thank you for pointing this out I've removed the pre-rotation and results improved significantly. On our 14-scenario tool-calling agent benchmark (file edits, multi-step instructions, exact string matching).

The Qwen 3.5 4B PARO now outperforms standard 8-bit quantization on scenarios passed, duplicate rate, throughput, and memory which is a great result.

I believe there's still room to squeeze more decode performance out of this, so I'll keep iterating. Thanks again for the feedback!

@spokvulcan spokvulcan marked this pull request as ready for review March 29, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants