Code For the paper: "Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs".

Installation

Option 1: Using uv (recommended)

uv sync

Note: Modify the CUDA version in pyproject.toml if needed. The default configuration uses CUDA 12.6. Update the [[tool.uv.index]] URL and torch source to match your CUDA version.

Option 2: Using pip

pip install -r requirements.txt

Note: Modify the CUDA version in requirements.txt if needed. The default PyTorch installation uses CUDA 12.6 (torch==2.8.0+cu126).

Usage

Step-by-Step Guide

Follow this order to run the complete pipeline:

Generate diffmean files and AUROC results:
```
bash scripts/run_diffmean_analysis.sh
```
Train the praise detector model:
```
bash scripts/run_roberta.sh
```
Run steering experiments:
```
bash scripts/run_all_steering.sh
```

Optional: Generate diffmean vectors with subspace removal (as demonstrated in the paper):

bash scripts/run_nullspace_weight_gen.sh

This can be run after step 1 to generate modified diffmean vectors, which can then be used with run_all_steering.sh as usual.

Scripts Overview

scripts/run_diffmean_analysis.sh - Run diffmean analysis and generate diffmean files
scripts/run_roberta.sh - Train RoBERTa praise detector
scripts/run_all_steering.sh - Run complete steering pipeline
scripts/run_nullspace_weight_gen.sh - Generate diffmean vectors with subspace removal

Individual Components

The main components can be run individually:

src/praise_steering.py - Praise-based steering experiments
src/truthful_steering.py - Truthfulness steering experiments
src/cross_effects.py - Cross-effects analysis
src/diffmean_analysis.py - Diffmean nullspace projection analysis
src/praise_roberta.py - RoBERTa praise detection model

Data

The data/ directory contains:

praise.json - Praise detection training data
truthfulqa/ - TruthfulQA train/test split
- train.jsonl - Training set
- test.jsonl - Test set
factorial/ - Factorial experimental data

Requirements

Python ≥3.11
CUDA-compatible GPU (recommended)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
out		out
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code For the paper: "Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs".

Installation

Option 1: Using uv (recommended)

Option 2: Using pip

Usage

Step-by-Step Guide

Scripts Overview

Individual Components

Data

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cincynlp/disentangle-sycophancy

Folders and files

Latest commit

History

Repository files navigation

Code For the paper: "Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs".

Installation

Option 1: Using uv (recommended)

Option 2: Using pip

Usage

Step-by-Step Guide

Scripts Overview

Individual Components

Data

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages