Skip to content

Commit 2be645f

Browse files
authored
Add agent CLI, Qwen3.5 vLLM support, and Docker improvements (#7)
- Add paroquant.cli.agent: interactive agent with MCP tool calling - Unify paroquant.cli.serve: auto-detect vLLM/MLX backend - Fix vLLM plugin for Qwen3.5: pad Marlin partitions to tile boundary, fix modules_to_not_convert for hybrid Mamba architectures - Add warmup request in chat and agent for kernel compilation - Bump Docker vLLM to 0.17.0, add TRITON_PTXAS_BLACKWELL_PATH for Jetson Thor - Update README with Qwen3.5 examples, agent usage, and install notes - Add agent optional dependency group (qwen-agent, mcp, soundfile) Made-with: Cursor
1 parent 1f66d13 commit 2be645f

File tree

9 files changed

+433
-43
lines changed

9 files changed

+433
-43
lines changed

.github/workflows/build-docker-images.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,15 @@ jobs:
3030
target: chat
3131
cuda_version: "13.0.2"
3232
torch_cuda_arch_list: "8.0 8.6 8.7 8.9 9.0 10.0 12.0 12.1"
33-
vllm_version: "0.15.1"
33+
vllm_version: "0.17.0"
3434
cuda_toolkit: "cu130"
3535
platforms: linux/amd64,linux/arm64
3636

3737
- tag: serve
3838
target: serve
3939
cuda_version: "13.0.2"
4040
torch_cuda_arch_list: "8.0 8.6 8.7 8.9 9.0 10.0 12.0 12.1"
41-
vllm_version: "0.15.1"
41+
vllm_version: "0.17.0"
4242
cuda_toolkit: "cu130"
4343
platforms: linux/amd64,linux/arm64
4444

README.md

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,35 +19,55 @@ State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rot
1919

2020
## Quick Start
2121

22-
### Interactive Chat
22+
### Installation
2323

2424
```bash
2525
# NVIDIA GPU
2626
pip install "paroquant[vllm]"
27-
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
2827

2928
# Apple Silicon
3029
pip install "paroquant[mlx]"
31-
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
30+
```
31+
32+
Pick a model from our [Hugging Face collection](https://huggingface.co/collections/z-lab/paroquant):
33+
34+
```bash
35+
export MODEL=z-lab/Qwen3.5-4B-PARO
36+
```
37+
38+
### Interactive Chat
39+
40+
```bash
41+
python -m paroquant.cli.chat --model $MODEL
3242
```
3343

3444
### OpenAI-Compatible API Server
3545

3646
```bash
37-
pip install "paroquant[vllm]"
38-
python -m paroquant.cli.serve --model z-lab/Qwen3-8B-PARO
47+
python -m paroquant.cli.serve --model $MODEL --port 8000
48+
```
49+
50+
### Agent with Tool Calling
51+
52+
Start the API server first, then install the agent dependencies and run:
53+
54+
```bash
55+
pip install "paroquant[agent]"
56+
python -m paroquant.cli.agent --model $MODEL
3957
```
4058

41-
### Docker
59+
Tool use (web fetch, filesystem, time) requires [uv](https://docs.astral.sh/uv/) and [Node.js](https://nodejs.org/en/download).
60+
61+
### Docker (NVIDIA GPU)
4262

4363
```bash
4464
# Interactive chat
4565
docker run --pull=always --rm -it --gpus all --ipc=host \
46-
ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
66+
ghcr.io/z-lab/paroquant:chat --model $MODEL
4767

4868
# API server (port 8000)
4969
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
50-
ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3-8B-PARO
70+
ghcr.io/z-lab/paroquant:serve --model $MODEL
5171
```
5272

5373
## Models

docker/Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
5555
pip install -e ".[vllm]"; \
5656
fi
5757
ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
58+
ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas
5859
ENTRYPOINT ["python", "-m", "paroquant.cli.chat"]
5960

6061
# ---- serve: OpenAI-compatible vLLM API server ----
@@ -72,6 +73,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
7273
pip install -e ".[vllm]"; \
7374
fi
7475
ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
76+
ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas
7577
ENTRYPOINT ["python", "-m", "paroquant.cli.serve"]
7678

7779
# ---- optim: optimization & evaluation ----

0 commit comments

Comments
 (0)