The LION-series are trained using an empirically optimized pipeline that consists of three stages: SFT, DPO, and online preference learning (online DPO). We find simple techniques such as sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of language models. Our best models (the LION-series) exceed the performance of the official instruct models tuned with closed-source data and algorithms.
We fine-tuned the base Gemma-2b and LLaMA-3-8b models. We released all our models and intermediate checkpoints for reproducibility.
| Model | Method | Size | Arena-Hard | AlpacaEval-2 | MT-Bench | OpenLLM |
|---|---|---|---|---|---|---|
| Gemma-2b | - | 2B | - | - | - | 46.69 |
| Gemma-2b-it | SFT+RLHF | 2B | 3.4 | 5.44 | 5.63 | 42.75 |
| Gemma-2b-zephyr | SFT+DPO | 2B | 0.9 | 2.65 | 4.13 | 46.92 |
| LLaMA-2-7b-chat | SFT | 7B | 4.6 | 5.35 | 6.22 | 53.16 |
| Vicuna-7b-v1.5 | SFT | 7B | 2.5 | 7.62 | 6.57 | 52.06 |
| LION-Gemma-2b-sft-v1.0 (ours) | SFT | 2B | 2.4 | 7.79 | 6.37 | 54.78 |
| LION-Gemma-2b-dpo-v1.0 (ours) | SFT+DPO | 2B | 4.6 | 8.75 | 6.58 | 55.35 |
| LION-Gemma-2b-odpo-v1.0 (ours) | SFT+DPO+ODPO | 2B | 5.0 | 9.57 | 6.75 | 55.98 |
| Model | Method | Size | Arena-Hard | AlpacaEval-2 | MT-Bench | OpenLLM |
|---|---|---|---|---|---|---|
| LLaMA-3-8b | - | 8B | - | - | - | 63.05 |
| LLaMA-3-8b-it | SFT+RS+DPO+PPO | 8B | 20.6 | 22.9 | 8.00 | 68.28 |
| LION-LLaMA-3-8b-sft-v1.0 (ours) | SFT | 8B | 11.3 | 17.9 | 7.58 | 68.71 |
| LION-LLaMA-3-8b-dpo-v1.0 (ours) | SFT+DPO | 8B | 19.1 | 21.8 | 8.12 | 71.28 |
| LION-LLaMA-3-8b-odpo-v1.0 (ours) | SFT+DPO+ODPO | 8B | 22.0 | 26.8 | 8.19 | 71.41 |
We recommend using Python 3.10 as the environment. To install the required dependencies, run:
pip install -r requirements.txtYou will need to install Flash Attention 2 by running:
python -m pip install flash-attn --no-build-isolationTraining requires (at most) 4xA100 80GB GPUs. Please adjust the batch size and gradient accumulation steps if you have a different system.
For SFT training, refer to the following script for an example of training LION-LLaMA-3-8b-sft-v1.0:
./examples/llama-3/sft.shTo speed up DPO training, we 1) pre-compute the reference logits and 2) run DPO trainer without initializing any reference model. We find this to reduce both the training time and memory usage!
As an example of training LION-LLaMA-3-8b-dpo-v1.0, you can use the following scripts:
./examples/llama-3/dpo.shOnline DPO is split into two parts: 1) generate online preference data and 2) train the model with DPO using the online data. As an example, you can train LION-LLaMA-3-8b-odpo-v1.0 with the following script:
./examples/llama-3/odpo.shWe provide easy ALL-IN-ONE evaluation pipeline in this directory for popular benchmarks including:
- MT-Bench
- Alpaca Eval 2.0
- Arena Hard Auto
- Huggingface OpenLLM 1.0
- Huggingface OpenLLM 2.0
For each benchmark, we additionally include the following features:
- automatically host your model on an
sglangserver for faster evaluation (except for OpenLLM evals) - support multiple runs to compute the average performance (
--num_runs) - logging performance to
wandb(--to_wandb)
For setup and run details, please refer to test.
If you find our repo useful, please consider cite it as follows:
@misc{yu2024lionsempiricallyoptimizedapproach,
title={LIONs: An Empirically Optimized Approach to Align Language Models},
author={Xiao Yu and Qingyang Wu and Yu Li and Zhou Yu},
year={2024},
eprint={2407.06542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.06542},
}