This repository contains a series of Jupyter notebooks demonstrating various quantization and optimization techniques for NLP models. These notebooks provide practical implementations of state-of-the-art methods for model compression and efficient inference.
01_large_language_model_optimization.ipynb: Optimizing a large language model for low-latency inference.02_vision_transformer_edge_optimization.ipynb: Fine-tuning and quantizing a Vision Transformer for edge devices.03_bert_question_answering_quantization.ipynb: Quantizing a BERT-based model for question answering tasks.04_multitask_nlp_quantization.ipynb: Transfer learning and quantization for multi-task NLP.
- Mixed precision training
- Post-training quantization (PTQ)
- Quantization-aware fine-tuning (QAF)
- Dynamic quantization
- Pruning techniques
- Layer fusion
- Efficient attention mechanisms
- Knowledge distillation
Check out my cheatsheet called "Quantization and Precision Tuning for Optimization" for some more info! Feel free to share :)
- Clone this repository
git clone https://github.com/ethanshebley/quantization-recipes.git
cd quantization-recipes
- Install the required packages
pip install -r requirements.txt
- Open the Jupyter notebooks and run!
I would welcome contributions! Please feel free to submit a Pull Request.