ParoQuant

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rotations to suppress weight outliers, closing the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX).

Quick Start

Interactive Chat

# NVIDIA GPU
pip install "paroquant[vllm]"
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO

# Apple Silicon
pip install "paroquant[mlx]"
python -m paroquant.cli.chat --model z-lab/Qwen3-8B-PARO

OpenAI-Compatible API Server

pip install "paroquant[vllm]"
python -m paroquant.cli.serve --model z-lab/Qwen3-8B-PARO

Docker

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3-8B-PARO

Models

All models are available on Hugging Face. Swap the model name in the commands above to try any of them.

Qwen3.5

Model	Checkpoint
Qwen3.5-0.8B	`z-lab/Qwen3.5-0.8B-PARO`
Qwen3.5-2B	`z-lab/Qwen3.5-2B-PARO`
Qwen3.5-4B	`z-lab/Qwen3.5-4B-PARO`
Qwen3.5-9B	`z-lab/Qwen3.5-9B-PARO`

Qwen3

Model	Checkpoint
Qwen3-0.6B	`z-lab/Qwen3-0.6B-PARO`
Qwen3-1.7B	`z-lab/Qwen3-1.7B-PARO`
Qwen3-4B	`z-lab/Qwen3-4B-PARO`
Qwen3-8B	`z-lab/Qwen3-8B-PARO`
Qwen3-14B	`z-lab/Qwen3-14B-PARO`

Llama

Model	Checkpoint
Llama-2-7B	`z-lab/Llama-2-7b-hf-PARO`
Llama-3-8B	`z-lab/Meta-Llama-3-8B-PARO`
Llama-3.1-8B-Instruct	`z-lab/Llama-3.1-8B-Instruct-PARO`

Want a model that's not listed? Open an issue and let us know.

Reproduction

Note

The main branch of this repository is under active development, and reproducibility is not guaranteed. Please use the legacy branch to reproduce results from the paper.

Installation

git clone https://github.com/z-lab/paroquant && cd paroquant

pip install -e ".[vllm]"            # vLLM backend (GPU, recommended)
pip install -e ".[transformers]"    # Transformers backend (GPU)
pip install -e ".[mlx]"             # MLX backend (Apple Silicon)
pip install -e ".[optim,eval]"      # Optimization & evaluation

Or use Docker: docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:latest

Quantize Your Own Model

# 1. Optimize rotation parameters
experiments/optimize/4bit.sh Qwen/Qwen3-8B

# 2. Export to HF checkpoint (--mode real for INT4, --mode pseudo for FP16)
python -m paroquant.cli.convert \
  --model Qwen/Qwen3-8B \
  --result-dir output/Qwen3-8B \
  --output-path models/Qwen3-8B-PARO

Docker Images

Image	Purpose
`ghcr.io/z-lab/paroquant:chat`	Interactive chat
`ghcr.io/z-lab/paroquant:chat-cu129`	Interactive chat (CUDA 12.9)
`ghcr.io/z-lab/paroquant:serve`	OpenAI-compatible API server
`ghcr.io/z-lab/paroquant:latest`	Optimization & evaluation
`ghcr.io/z-lab/paroquant:eval`	Reasoning task evaluation

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
assets		assets
docker		docker
experiments		experiments
paroquant		paroquant
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParoQuant

Quick Start

Interactive Chat

OpenAI-Compatible API Server

Docker

Models

Reproduction

Installation

Quantize Your Own Model

Docker Images

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParoQuant

Quick Start

Interactive Chat

OpenAI-Compatible API Server

Docker

Models

Reproduction

Installation

Quantize Your Own Model

Docker Images

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages