Code For the paper: "Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs".
Option 1: Using uv (recommended)
uv syncNote: Modify the CUDA version in pyproject.toml if needed. The default configuration uses CUDA 12.6. Update the [[tool.uv.index]] URL and torch source to match your CUDA version.
pip install -r requirements.txtNote: Modify the CUDA version in requirements.txt if needed. The default PyTorch installation uses CUDA 12.6 (torch==2.8.0+cu126).
Follow this order to run the complete pipeline:
-
Generate diffmean files and AUROC results:
bash scripts/run_diffmean_analysis.sh
-
Train the praise detector model:
bash scripts/run_roberta.sh
-
Run steering experiments:
bash scripts/run_all_steering.sh
Optional: Generate diffmean vectors with subspace removal (as demonstrated in the paper):
bash scripts/run_nullspace_weight_gen.shThis can be run after step 1 to generate modified diffmean vectors, which can then be used with run_all_steering.sh as usual.
scripts/run_diffmean_analysis.sh- Run diffmean analysis and generate diffmean filesscripts/run_roberta.sh- Train RoBERTa praise detectorscripts/run_all_steering.sh- Run complete steering pipelinescripts/run_nullspace_weight_gen.sh- Generate diffmean vectors with subspace removal
The main components can be run individually:
src/praise_steering.py- Praise-based steering experimentssrc/truthful_steering.py- Truthfulness steering experimentssrc/cross_effects.py- Cross-effects analysissrc/diffmean_analysis.py- Diffmean nullspace projection analysissrc/praise_roberta.py- RoBERTa praise detection model
The data/ directory contains:
praise.json- Praise detection training datatruthfulqa/- TruthfulQA train/test splittrain.jsonl- Training settest.jsonl- Test set
factorial/- Factorial experimental data
- Python ≥3.11
- CUDA-compatible GPU (recommended)