Disentangling the Origins of Cognitive Biases in Language Models.
This repository contains the code for our paper:
We investigate the origin of cognitive biases in large language models (LLMs). While prior work showed these biases emerge and even intensify after instruction tuning, it's unclear whether they are caused by pretraining, finetuning data, or training randomness.
We propose a two-step causal analysis framework:
- First, we assess how much random seed fluctuations affect bias scores.
- Second, we perform cross-tuning: swapping instruction datasets between pretrained models to identify if biases are driven by the pretraining backbone or the finetuning data.
Our results show that:
- Training randomness introduces some noise in bias scores.
- However, pretraining consistently dominates as the primary source of biases, with instruction tuning playing a secondary role.
This repository integrates and builds on three main sub-repositories:
- 📦
open-instruct: Parameter-efficient LoRA finetuning framework. - 📊
instructed-to-bias: Evaluation for belief and certainty biases. - 🧠
cognitive-biases-in-llms: Benchmark suite for 30 cognitive biases.
Refer to those repositories for dataset structures, implementation details, and original evaluation scripts.
All trained models across seeds and the subsampled Flan instruction dataset are hosted on Hugging Face:
🤗 Hugging Face Collection: planted_in_pretraining
We recommend setting up with conda:
conda create -n bias_origin python=3.10 -y
conda activate bias_origin
pip install -r requirements.txt
# Optional: install submodules in editable mode
git clone https://github.com/allenai/open-instruct.git
pip install -e open-instruct/
git clone https://github.com/itay1itzhak/InstructedToBias.git
pip install -e instructed-to-bias/
git clone https://github.com/simonmalberg/cognitive-biases-in-llms.git
pip install -e cognitive-biases-in-llms/What it checks:
This experiment finetunes the same model and dataset with different seeds to test how much randomness affects bias scores.
python run_randomness_analysis.py --granularity-levels model_biasWhat we found:
Randomness introduces minor fluctuations in individual bias scores, but averaging across seeds recovers stable patterns. This suggests randomness alone is not a primary driver of cognitive bias.
What it checks:
This analysis swaps instruction datasets between two pretrained models (e.g., Flan vs Tulu) and compares their bias vectors. We cluster models either by pretraining backbone or instruction data.
python run_similarity_analysis.py \
--granularity-levels model_bias_scenario \
--models-to-include T5,OLMoWhat we found:
Models cluster strongly by pretraining identity. Even after swapping instruction data, bias patterns remain closer to the original backbone than to the new data. This supports our main claim: biases are planted during pretraining.
Example outputs (PDFs saved to plots/):
To cite our work, use the CoLM BibTeX entry from Google Scholar or:
@misc{itzhak2025plantedpretrainingswayedfinetuning,
title={Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs},
author={Itay Itzhak and Yonatan Belinkov and Gabriel Stanovsky},
year={2025},
eprint={2507.07186},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07186},
}MIT License, Copyright (c) 2025 Itay Itzhak
For questions or collaborations, please reach out via GitHub Issues or email:
📧 [itay1itzhak@gmaildotcom]


