Skip to content

itay1itzhak/planted-in-pretraining

Repository files navigation

Planted in Pretraining, Swayed by Finetuning

Disentangling the Origins of Cognitive Biases in Language Models.

Paper Models Website Contact

Project Logo

📘 Introduction

This repository contains the code for our paper:

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

We investigate the origin of cognitive biases in large language models (LLMs). While prior work showed these biases emerge and even intensify after instruction tuning, it's unclear whether they are caused by pretraining, finetuning data, or training randomness.

We propose a two-step causal analysis framework:

  • First, we assess how much random seed fluctuations affect bias scores.
  • Second, we perform cross-tuning: swapping instruction datasets between pretrained models to identify if biases are driven by the pretraining backbone or the finetuning data.

Our results show that:

  • Training randomness introduces some noise in bias scores.
  • However, pretraining consistently dominates as the primary source of biases, with instruction tuning playing a secondary role.

🧭 Repository Structure

This repository integrates and builds on three main sub-repositories:

Refer to those repositories for dataset structures, implementation details, and original evaluation scripts.


🔗 Model & Dataset Access

All trained models across seeds and the subsampled Flan instruction dataset are hosted on Hugging Face:

🤗 Hugging Face Collection: planted_in_pretraining


⚙️ Environment Setup

We recommend setting up with conda:

conda create -n bias_origin python=3.10 -y
conda activate bias_origin
pip install -r requirements.txt

# Optional: install submodules in editable mode
git clone https://github.com/allenai/open-instruct.git
pip install -e open-instruct/

git clone https://github.com/itay1itzhak/InstructedToBias.git
pip install -e instructed-to-bias/

git clone https://github.com/simonmalberg/cognitive-biases-in-llms.git
pip install -e cognitive-biases-in-llms/

🚀 Key Analyses

🎲 Step 1: Training Randomness Analysis

What it checks:
This experiment finetunes the same model and dataset with different seeds to test how much randomness affects bias scores.

python run_randomness_analysis.py --granularity-levels model_bias

What we found:
Randomness introduces minor fluctuations in individual bias scores, but averaging across seeds recovers stable patterns. This suggests randomness alone is not a primary driver of cognitive bias.


🔁 Step 2: Cross-Tuning Clustering Analysis

What it checks:
This analysis swaps instruction datasets between two pretrained models (e.g., Flan vs Tulu) and compares their bias vectors. We cluster models either by pretraining backbone or instruction data.

python run_similarity_analysis.py \
    --granularity-levels model_bias_scenario \
    --models-to-include T5,OLMo

What we found:
Models cluster strongly by pretraining identity. Even after swapping instruction data, bias patterns remain closer to the original backbone than to the new data. This supports our main claim: biases are planted during pretraining.


📊 Visual Outputs

Example outputs (PDFs saved to plots/):

Randomness Plot Cross-Tuning PCA

📚 Citation

To cite our work, use the CoLM BibTeX entry from Google Scholar or:

@misc{itzhak2025plantedpretrainingswayedfinetuning,
      title={Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs}, 
      author={Itay Itzhak and Yonatan Belinkov and Gabriel Stanovsky},
      year={2025},
      eprint={2507.07186},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.07186}, 
}

📜 License

MIT License, Copyright (c) 2025 Itay Itzhak


📬 Contact

For questions or collaborations, please reach out via GitHub Issues or email:
📧 [itay1itzhak@gmaildotcom]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published