Intelligent LLM compression benchmarks. Less memory, same accuracy.
Tested on Qwen2.5-0.5B-Instruct — Tesla T4 GPU (Google Colab)
| Method | Memory | Speed | Accuracy | Verdict |
|---|---|---|---|---|
| Float16 (baseline) | 0.99 GB | 13.8 tok/s | ✅ Correct | Baseline |
| INT8 (bitsandbytes) | 0.63 GB | 6.7 tok/s | ❌ Hallucinations | ❌ Worse in every metric |
| NF4 + Double Quant | 0.45 GB | 12.4 tok/s | ✅ Correct | ✅ Best ratio |
NF4 reduces memory by 54% with no accuracy loss and near-identical speed.
Standard INT8 is simultaneously slower (2x) and less accurate. This is why NF4 is the right baseline — and why smarter per-layer strategies can push even further.
Float16: "Miguel de Cervantes Saavedra" ✅
INT8: "Miguel de Cervantes y Góngora" ❌ (hallucination)
NF4: "Miguel de Cervantes Saavedra" ✅
Running a model that is 54% smaller means:
- Same GPU, double the capacity — fit 2 models where you ran 1
- ~50% less cloud cost — direct impact on inference bills
- No accuracy trade-off — users see no difference
- Same latency — 12.4 vs 13.8 tok/s is imperceptible
At scale (1M requests/day), the difference between Float16 and NF4 is roughly $15,000–40,000/month in saved compute, depending on provider.
- Open Google Colab → New notebook
- Runtime → Change runtime type → T4 GPU
- Copy and run
scripts/benchmark_nf4.py
git clone https://github.com/YOUR_USERNAME/neuralzip
cd neuralzip
python -m venv env
source env/bin/activate # Windows: env\Scripts\activate
pip install -r requirements.txt
python scripts/benchmark_nf4.pyRequirements: Python 3.9+, 8GB RAM minimum, NVIDIA GPU recommended.
neuralzip/
├── scripts/
│ ├── benchmark_baseline.py # INT8 vs Float16 (v0.1)
│ └── benchmark_nf4.py # NF4 vs INT8 vs Float16 (v0.2) ← current
├── results/
│ └── t4_qwen05b.md # Raw benchmark output
├── requirements.txt
├── LICENSE
└── README.md
v0.1 ✅ INT8 quantization baseline
v0.2 ✅ NF4 4-bit with double quantization ← you are here
v0.3 🔧 Adaptive per-layer thresholds (beat NF4 standard)
v0.4 📋 Domain-specific calibration data
v0.5 📋 Structured pruning (attention head scoring)
v1.0 📋 Full pipeline: Quantization + Pruning + Distillation
| Hardware | Float16 | INT8 | NF4 |
|---|---|---|---|
| Tesla T4 (Colab) | 13.8 tok/s | 6.7 tok/s | 12.4 tok/s |
| CPU only | 6.3 tok/s | 0.8 tok/s | — |
More hardware results welcome via PR.
This project is in early stages. If you run benchmarks on different hardware
or models, open a PR with your results/ file. All results are welcome.
MIT License — see LICENSE
NeuralZip — Making AI inference cheaper, one layer at a time.