Skip to content

Commit 1f66d13

Browse files
committed
Update README and fix serve CLI import
Reorganize Quick Start into clear sections (chat, serve, Docker). Move Qwen3.5 models to top of model list. Fix serve CLI to use vllm's FlexibleArgumentParser and async run_server. Made-with: Cursor
1 parent acc217a commit 1f66d13

File tree

2 files changed

+34
-22
lines changed

2 files changed

+34
-22
lines changed

README.md

Lines changed: 32 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -19,28 +19,50 @@ State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rot
1919

2020
## Quick Start
2121

22-
**NVIDIA GPU:**
22+
### Interactive Chat
2323

2424
```bash
25+
# NVIDIA GPU
2526
pip install "paroquant[vllm]"
2627
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
2728

28-
# or with Docker
29-
docker run --pull=always --rm -it --gpus all --ipc=host \
30-
ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
29+
# Apple Silicon
30+
pip install "paroquant[mlx]"
31+
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
3132
```
3233

33-
**Apple Silicon:**
34+
### OpenAI-Compatible API Server
3435

3536
```bash
36-
pip install "paroquant[mlx]"
37-
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
37+
pip install "paroquant[vllm]"
38+
python -m paroquant.cli.serve --model z-lab/Qwen3-8B-PARO
39+
```
40+
41+
### Docker
42+
43+
```bash
44+
# Interactive chat
45+
docker run --pull=always --rm -it --gpus all --ipc=host \
46+
ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
47+
48+
# API server (port 8000)
49+
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
50+
ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3-8B-PARO
3851
```
3952

4053
## Models
4154

4255
All models are available on [Hugging Face](https://huggingface.co/collections/z-lab/paroquant). Swap the model name in the commands above to try any of them.
4356

57+
**Qwen3.5**
58+
59+
| Model | Checkpoint |
60+
|---|---|
61+
| Qwen3.5-0.8B | [`z-lab/Qwen3.5-0.8B-PARO`](https://huggingface.co/z-lab/Qwen3.5-0.8B-PARO) |
62+
| Qwen3.5-2B | [`z-lab/Qwen3.5-2B-PARO`](https://huggingface.co/z-lab/Qwen3.5-2B-PARO) |
63+
| Qwen3.5-4B | [`z-lab/Qwen3.5-4B-PARO`](https://huggingface.co/z-lab/Qwen3.5-4B-PARO) |
64+
| Qwen3.5-9B | [`z-lab/Qwen3.5-9B-PARO`](https://huggingface.co/z-lab/Qwen3.5-9B-PARO) |
65+
4466
**Qwen3**
4567

4668
| Model | Checkpoint |
@@ -51,15 +73,6 @@ All models are available on [Hugging Face](https://huggingface.co/collections/z-
5173
| Qwen3-8B | [`z-lab/Qwen3-8B-PARO`](https://huggingface.co/z-lab/Qwen3-8B-PARO) |
5274
| Qwen3-14B | [`z-lab/Qwen3-14B-PARO`](https://huggingface.co/z-lab/Qwen3-14B-PARO) |
5375

54-
**Qwen3.5**
55-
56-
| Model | Checkpoint |
57-
|---|---|
58-
| Qwen3.5-0.8B | [`z-lab/Qwen3.5-0.8B-PARO`](https://huggingface.co/z-lab/Qwen3.5-0.8B-PARO) |
59-
| Qwen3.5-2B | [`z-lab/Qwen3.5-2B-PARO`](https://huggingface.co/z-lab/Qwen3.5-2B-PARO) |
60-
| Qwen3.5-4B | [`z-lab/Qwen3.5-4B-PARO`](https://huggingface.co/z-lab/Qwen3.5-4B-PARO) |
61-
| Qwen3.5-9B | [`z-lab/Qwen3.5-9B-PARO`](https://huggingface.co/z-lab/Qwen3.5-9B-PARO) |
62-
6376
**Llama**
6477

6578
| Model | Checkpoint |
@@ -106,10 +119,10 @@ python -m paroquant.cli.convert \
106119

107120
| Image | Purpose |
108121
|---|---|
109-
| `ghcr.io/z-lab/paroquant:latest` | Optimization & evaluation |
110-
| `ghcr.io/z-lab/paroquant:chat` | Interactive chat (CUDA 13.0) |
111-
| `ghcr.io/z-lab/paroquant:serve` | OpenAI-compatible API server |
122+
| `ghcr.io/z-lab/paroquant:chat` | Interactive chat |
112123
| `ghcr.io/z-lab/paroquant:chat-cu129` | Interactive chat (CUDA 12.9) |
124+
| `ghcr.io/z-lab/paroquant:serve` | OpenAI-compatible API server |
125+
| `ghcr.io/z-lab/paroquant:latest` | Optimization & evaluation |
113126
| `ghcr.io/z-lab/paroquant:eval` | Reasoning task evaluation |
114127

115128
## Citation

paroquant/cli/serve.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
1-
"""Thin wrapper around ``vllm serve`` that auto-registers the ParoQuant quantization plugin."""
2-
31
from __future__ import annotations
42

53
import asyncio
64
import sys
75

8-
import paroquant.inference.backends.vllm.plugin # noqa: F401 — registers quantization config
96
from vllm.entrypoints.openai.api_server import (
107
FlexibleArgumentParser,
118
make_arg_parser,
129
run_server,
1310
)
1411

12+
import paroquant.inference.backends.vllm.plugin # noqa: F401 — registers quantization config
13+
1514

1615
def main():
1716
parser = make_arg_parser(FlexibleArgumentParser())

0 commit comments

Comments
 (0)