feat: add ParoQuant (pairwise rotation quantization) support by spokvulcan · Pull Request #164 · ml-explore/mlx-swift-lm

spokvulcan · 2026-03-27T23:05:36Z

Summary

Adds support for ParoQuant (ICLR 2026) INT4-quantized models. ParoQuant uses learned pairwise Givens rotations to suppress weight outliers before quantization, achieving ~2.4% accuracy improvement over standard AWQ on reasoning tasks.

What's included:

RotateQuantizedLinear — applies channel-scaling + pairwise Givens rotation to activations at runtime via Metal kernel, then feeds into standard quantizedMatmul
ParoQuantLoader — loads AutoAWQ-format checkpoints, converts weight layout, and patches rotation layers with their learned parameters (theta, pairs, channel_scales)
Qwen 3.5 fixes — in_proj_ba split in sanitize for PARO checkpoint compatibility, hybrid cache handling for mixed attention/Mamba layers
7 unit tests (pair packing, AWQ conversion, quantization round-trip)

Agent benchmark results

We run a 14-scenario tool-calling benchmark (file reads, edits, multi-step instructions, clarification requests) on Apple Silicon. Results are multi-run averages.

Model	Runs	Passed	Tool Acc	Dup Rate	tok/s	Peak Mem
Qwen3.5-9B PARO (INT4)	4	7.2/14	85.7%	0.9%	51	9.0 GB
Qwen3.5-9B OptiQ (4.5-bit)	4	6.2/14	84.3%	3.1%	40	8.3 GB
Qwen3.5-4B PARO (INT4)	7	5.6/14	77.5%	9.7%	67	4.2 GB
Qwen3.5-4B (8-bit)	7	4.3/14	79.6%	12.3%	58	5.6 GB

The 9B PARO outperforms 4.5-bit mixed-precision (OptiQ) on scenarios passed, duplicate rate, and throughput while using comparable memory.

Key finding: an earlier version pre-baked rotations into quantized weights for +23% throughput. This degraded structured output quality on the 9B model (5/14 passed, 11% dups). Applying rotation to activations at runtime instead — keeping weights in their original INT4 form — restored accuracy at a small throughput cost.

Test plan

Unit tests pass
End-to-end inference with Qwen3.5-4B-PARO and Qwen3.5-9B-PARO
Non-PARO Qwen 3.5 models still load correctly (regression check)
Agent benchmark on 4 model variants (table above)

Add support for loading PARO-quantized models (AutoAWQ format) with pairwise Givens rotation. Pre-rotates weights at load time to eliminate per-token rotation kernel dispatch, yielding +23% generation throughput. Key changes: - RotateQuantizedLinear: Metal rotation kernel + weight pre-rotation - ParoQuantLoader: AWQ conversion, loading, disk caching of pre-rotated weights - Qwen35: in_proj_b/a fusion via sanitize for single matmul dispatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The sanitize-based fusion only concatenated .weight but not .scales and .biases, leaving the fused module unquantized and causing shape mismatches when loading standard quantized Qwen3.5 checkpoints. Reverts to separate in_proj_b / in_proj_a modules from the base branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PARO checkpoints store in_proj_b + in_proj_a as a single fused in_proj_ba key. Add sanitize logic to split it back into separate in_proj_b / in_proj_a weights (and scales/biases if quantized). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On hybrid models (e.g. Qwen 3.5 mixing attention + Mamba layers), cache[0] is a MambaCache with offset=0. The guard failed and no layers got quantized. Now finds the first KVCacheSimple entry instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add 8 unit tests covering pair packing, AutoAWQ conversion, quantization round-trip, and pre-rotation equivalence. Document the channel-scales asymmetry (diag(s²) factor) on callAsFunction for future maintainers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

liang2kl · 2026-03-29T07:00:41Z

It is not appropriate to pre-rotate the weights. The purpose of the rotations is to make the weights more quantization-friendly. If you rotate the weights then dequantize them the rotation will have no effects at all.

spokvulcan · 2026-03-29T14:28:35Z

@liang2kl Thank you for pointing this out I've removed the pre-rotation and results improved significantly. On our 14-scenario tool-calling agent benchmark (file edits, multi-step instructions, exact string matching).

The Qwen 3.5 4B PARO now outperforms standard 8-bit quantization on scenarios passed, duplicate rate, throughput, and memory which is a great result.

I believe there's still room to squeeze more decode performance out of this, so I'll keep iterating. Thanks again for the feedback!

spokvulcan and others added 5 commits March 28, 2026 00:25

Remove ParoQuant weight prerotation

1cd4438

Remove copyright comments from ParoQuant files

2f3e290

spokvulcan marked this pull request as ready for review March 29, 2026 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ParoQuant (pairwise rotation quantization) support#164

feat: add ParoQuant (pairwise rotation quantization) support#164
spokvulcan wants to merge 7 commits intoml-explore:mainfrom
spokvulcan:feat/paroquant-support

spokvulcan commented Mar 27, 2026 •

edited

Loading

Uh oh!

liang2kl commented Mar 29, 2026

Uh oh!

spokvulcan commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spokvulcan commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Agent benchmark results

Test plan

Uh oh!

liang2kl commented Mar 29, 2026

Uh oh!

spokvulcan commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spokvulcan commented Mar 27, 2026 •

edited

Loading

spokvulcan commented Mar 29, 2026 •

edited

Loading