Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rotations to suppress weight outliers, closing the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX).
# NVIDIA GPU
pip install "paroquant[vllm]"
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO
# Apple Silicon
pip install "paroquant[mlx]"
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PAROpip install "paroquant[vllm]"
python -m paroquant.cli.serve --model z-lab/Qwen3-8B-PARO# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3-8B-PAROAll models are available on Hugging Face. Swap the model name in the commands above to try any of them.
Qwen3.5
| Model | Checkpoint |
|---|---|
| Qwen3.5-0.8B | z-lab/Qwen3.5-0.8B-PARO |
| Qwen3.5-2B | z-lab/Qwen3.5-2B-PARO |
| Qwen3.5-4B | z-lab/Qwen3.5-4B-PARO |
| Qwen3.5-9B | z-lab/Qwen3.5-9B-PARO |
Qwen3
| Model | Checkpoint |
|---|---|
| Qwen3-0.6B | z-lab/Qwen3-0.6B-PARO |
| Qwen3-1.7B | z-lab/Qwen3-1.7B-PARO |
| Qwen3-4B | z-lab/Qwen3-4B-PARO |
| Qwen3-8B | z-lab/Qwen3-8B-PARO |
| Qwen3-14B | z-lab/Qwen3-14B-PARO |
Llama
| Model | Checkpoint |
|---|---|
| Llama-2-7B | z-lab/Llama-2-7b-hf-PARO |
| Llama-3-8B | z-lab/Meta-Llama-3-8B-PARO |
| Llama-3.1-8B-Instruct | z-lab/Llama-3.1-8B-Instruct-PARO |
Want a model that's not listed? Open an issue and let us know.
Note
The main branch of this repository is under active development, and reproducibility is not guaranteed.
Please use the legacy branch to reproduce results from the paper.
git clone https://github.com/z-lab/paroquant && cd paroquant
pip install -e ".[vllm]" # vLLM backend (GPU, recommended)
pip install -e ".[transformers]" # Transformers backend (GPU)
pip install -e ".[mlx]" # MLX backend (Apple Silicon)
pip install -e ".[optim,eval]" # Optimization & evaluationOr use Docker: docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:latest
# 1. Optimize rotation parameters
experiments/optimize/4bit.sh Qwen/Qwen3-8B
# 2. Export to HF checkpoint (--mode real for INT4, --mode pseudo for FP16)
python -m paroquant.cli.convert \
--model Qwen/Qwen3-8B \
--result-dir output/Qwen3-8B \
--output-path models/Qwen3-8B-PARO| Image | Purpose |
|---|---|
ghcr.io/z-lab/paroquant:chat |
Interactive chat |
ghcr.io/z-lab/paroquant:chat-cu129 |
Interactive chat (CUDA 12.9) |
ghcr.io/z-lab/paroquant:serve |
OpenAI-compatible API server |
ghcr.io/z-lab/paroquant:latest |
Optimization & evaluation |
ghcr.io/z-lab/paroquant:eval |
Reasoning task evaluation |
@inproceedings{liang2026paroquant,
title = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
author = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}