NeuralZip 🗜️

Intelligent LLM compression benchmarks. Less memory, same accuracy.

Benchmark Results

Tested on Qwen2.5-0.5B-Instruct — Tesla T4 GPU (Google Colab)

Method	Memory	Speed	Accuracy	Verdict
Float16 (baseline)	0.99 GB	13.8 tok/s	✅ Correct	Baseline
INT8 (bitsandbytes)	0.63 GB	6.7 tok/s	❌ Hallucinations	❌ Worse in every metric
NF4 + Double Quant	0.45 GB	12.4 tok/s	✅ Correct	✅ Best ratio

Key finding

NF4 reduces memory by 54% with no accuracy loss and near-identical speed.
Standard INT8 is simultaneously slower (2x) and less accurate. This is why NF4 is the right baseline — and why smarter per-layer strategies can push even further.

Accuracy test — "Who wrote Don Quixote?"

Float16:  "Miguel de Cervantes Saavedra"  ✅
INT8:     "Miguel de Cervantes y Góngora" ❌ (hallucination)
NF4:      "Miguel de Cervantes Saavedra"  ✅

Why This Matters

Running a model that is 54% smaller means:

Same GPU, double the capacity — fit 2 models where you ran 1
~50% less cloud cost — direct impact on inference bills
No accuracy trade-off — users see no difference
Same latency — 12.4 vs 13.8 tok/s is imperceptible

At scale (1M requests/day), the difference between Float16 and NF4 is roughly $15,000–40,000/month in saved compute, depending on provider.

Quickstart

Option A — Google Colab (recommended, free GPU)

Open Google Colab → New notebook
Runtime → Change runtime type → T4 GPU
Copy and run scripts/benchmark_nf4.py

Option B — Local

git clone https://github.com/YOUR_USERNAME/neuralzip
cd neuralzip
python -m venv env
source env/bin/activate      # Windows: env\Scripts\activate
pip install -r requirements.txt
python scripts/benchmark_nf4.py

Requirements: Python 3.9+, 8GB RAM minimum, NVIDIA GPU recommended.

Project Structure

neuralzip/
├── scripts/
│   ├── benchmark_baseline.py   # INT8 vs Float16 (v0.1)
│   └── benchmark_nf4.py        # NF4 vs INT8 vs Float16 (v0.2) ← current
├── results/
│   └── t4_qwen05b.md           # Raw benchmark output
├── requirements.txt
├── LICENSE
└── README.md

Roadmap

v0.1  ✅  INT8 quantization baseline
v0.2  ✅  NF4 4-bit with double quantization  ← you are here
v0.3  🔧  Adaptive per-layer thresholds (beat NF4 standard)
v0.4  📋  Domain-specific calibration data
v0.5  📋  Structured pruning (attention head scoring)
v1.0  📋  Full pipeline: Quantization + Pruning + Distillation

Hardware tested

Hardware	Float16	INT8	NF4
Tesla T4 (Colab)	13.8 tok/s	6.7 tok/s	12.4 tok/s
CPU only	6.3 tok/s	0.8 tok/s	—

More hardware results welcome via PR.

Contributing

This project is in early stages. If you run benchmarks on different hardware or models, open a PR with your results/ file. All results are welcome.

License

MIT License — see LICENSE

NeuralZip — Making AI inference cheaper, one layer at a time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuralZip 🗜️

Benchmark Results

Key finding

Accuracy test — "Who wrote Don Quixote?"

Why This Matters

Quickstart

Option A — Google Colab (recommended, free GPU)

Option B — Local

Project Structure

Roadmap

Hardware tested

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

davidibarzabal/neuralzip

Folders and files

Latest commit

History

Repository files navigation

NeuralZip 🗜️

Benchmark Results

Key finding

Accuracy test — "Who wrote Don Quixote?"

Why This Matters

Quickstart

Option A — Google Colab (recommended, free GPU)

Option B — Local

Project Structure

Roadmap

Hardware tested

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages