SuperActivator Mechanism Analysis

This repository implements the research on superactivator tokens - a phenomenon where a small subset of highly-activated tokens in the extreme tail of activation distributions can reliably signal concept presence. This repository contains the implementation for studying concept detection and inversion using these superactivator tokens.

Datasets

This codebase is designed to work with the following datasets:

Vision Datasets

CLEVR - Synthetic scenes with objects of different colors (Blue, Green, Red) and shapes (Cube, Cylinder, Sphere). Generated using the CLEVR generator with single-object scenes.
COCO - Subset of MS COCO dataset with 80 common object categories. We reference image indices and annotations only; original images must be obtained from the official COCO dataset.
Broden-Pascal & Broden-OpenSurfaces - Concept annotations from the Broden dataset for network dissection. We include metadata referencing concept labels from the original Broden dataset.

Text Datasets

Sarcasm - Synthetic sarcasm dataset created for this work. Contains paragraph-level and word-level sarcasm annotations.
iSarcasm - Extended version of the iSarcasm dataset with additional context. Due to licensing restrictions, base iSarcasm text must be obtained from the original source. The augmentation process is detailed in the paper.
GoEmotions - Enhanced version of Google's GoEmotions dataset with additional filler text. Based on GoEmotions (CC BY 4.0).

Download Prepared Datasets

Download CLEVR, Sarcasm, and GoEmotions datasets from: https://drive.google.com/drive/folders/1rwrZjWGRF2OpWv6ESMHn87OVl55KsL65?usp=sharing

(COCO, Broden, and iSarcasm must be obtained from their original sources due to licensing restrictions)

Each dataset folder in Data/ contains:

metadata.csv - Sample identifiers, concept/label information, and file paths
patches_w_image_mask_inputsize_(224, 224).pt - Padding masks for CLIP (vision datasets only)
patches_w_image_mask_inputsize_(560, 560).pt - Padding masks for Llama Vision (vision datasets only)

The padding masks indicate which patches contain actual image content vs padding, essential for accurate patch-level analysis.

To use these datasets:

Download from the Google Drive link above or the original sources
Update the image_path or text_path columns in metadata.csv to reflect your local paths
Run the analysis scripts with appropriate dataset arguments

Installation

Clone the repository:

git clone <repository-url>
cd SuperActivators

Install dependencies:

pip install -r requirements.txt
# then install this project
pip install -e .

Set up environment variables if needed:

export HF_TOKEN=<TOKEN>
export HF_HOME=/path/to/huggingface/cache
export CUDA_VISIBLE_DEVICES=0  # Select GPU

Install additional data from Google Drive too large to fit in repo, and then unzip:

# unzip the file
unzip SuperActivator_Data.zip

# move downloaded data to the correct location
mv SuperActivator\ Data/CLEVR/* Data/CLEVR/
mv SuperActivator\ Data/Augmented_GoEmotions/* Data/Augmented_GoEmotions/
mv SuperActivator\ Data/Sarcasm/* Data/Sarcasm/

Main Concept Detection Analysis

The concept detection analysis extracts embeddings from transformer models and evaluates concept detection performance. Run these scripts sequentially from the scripts directory:

Core Analysis Steps:

# 1. Extract embeddings (note this will always install both CLIP and Llama models from HF)
# For images:
python scripts/compute_image_gt_samples.py
# → Identifies ground truth sample indices for concept evaluation
# → Saves to: GT_Samples/{dataset}/

python scripts/embed_image_datasets.py
# → Computes CLIP/Llama embeddings for image patches and CLS tokens
# → Saves to: Embeddings/{dataset}/

# For text:
python scripts/embed_text_datasets.py
# → Computes text embeddings and GT samples in one step
# → Saves to: Embeddings/{dataset}/ and GT_Samples/{dataset}/

# 2. Learn concepts
python scripts/compute_all_concepts.py
# → Learns concept vectors using avg, linear separators, and k-means
# → Saves to: Concepts/{dataset}/

# 3. Compute activations
python scripts/compute_activations.py
# → Computes cosine similarities and signed distances for all concepts
# → Saves to: Cosine_Similarities/{dataset}/ and Distances/{dataset}/

# 4. Find thresholds for different percentiles
python scripts/validation_thresholds.py
# → Computes detection thresholds for different N% of positive calibration samples
# → Saves to: Thresholds/{dataset}/

# 5. Compute detection statistics
python scripts/all_detection_stats.py
# → Evaluates concept detection performance (F1, precision, recall)
# → Saves to: Quant_Results/{dataset}/

# 6. Compute direct alignment inversion statistics
python scripts/all_inversion_stats.py
# → Performs direct alignment inversion for concept localization and attribution
# → Saves to: Quant_Results/{dataset}/ (inversion metrics)

After completing the analysis, all quantitative results (detection metrics, F1 scores, precision/recall curves, etc.) will be saved in the Quant_Results/ folder.

Extended Analysis (Optional):

After the main analysis, run these for additional insights:

# Compare with baseline aggregation methods (max token, mean token, last token, random token)
python scripts/baseline_detections.py

# Find optimal percentthrumodel for each concept
python scripts/per_concept_ptm_optimization.py
# → Finds best layer (percentthrumodel) for each concept based on F1 scores
# → Saves to: Per_Concept_PTM_Optimization/{dataset}/

Command Line Arguments

All analysis scripts support command line arguments. Examples:

# Process specific datasets and models
python scripts/embed_image_datasets.py --models CLIP Llama --datasets CLEVR Coco

# Use specific percentthrumodel values
python scripts/compute_all_concepts.py --percentthrumodels 0 25 50 75 100

# Process single dataset with specific model
python scripts/compute_activations.py --model CLIP --dataset CLEVR

Most scripts support:

--model or --models: Specify which model(s) to use
--dataset or --datasets: Specify which dataset(s) to process
--percentthrumodels: List of layer percentages to analyze
--sample_type: Choose between 'patch' (same as token in this context) or 'cls' analysis

Alternative Analysis Methods

1. Prompt Concepts Pipeline

Extract concepts using vision-language models through prompting:

# Extract concepts
python scripts/extract_prompt_concepts.py --dataset CLEVR --model llama3.2-11

# Evaluate performance
python scripts/extract_prompt_concepts.py --dataset CLEVR --model llama3.2-11 --eval

Supported models:

llama3.2-11 (Llama-3.2-11B-Vision-Instruct)
qwen2.5-vl-3 (Qwen2.5-VL-3B-Instruct)

Results are saved in prompt_results/{dataset}/.

2. SAE (Sparse Autoencoder) Pipeline

Analyze pretrained sparse autoencoders:

For Images:

cd scripts/pretrained_saes/
python embed_image_datasets_sae.py
python compute_activations_sae_sparse.py
python postprocess_sae_activations.py
python sae_validation_thresholds_dense.py
python sae_detection_stats_dense.py
python sae_inversion_stats_dense.py

For Text:

cd scripts/pretrained_saes/
python embed_text_datasets_sae.py
# Continue with same steps as images

Visualization & Analysis

Jupyter Notebooks

The repository includes four analysis notebooks in the notebooks/ directory:

jupyter lab notebooks/

Activation-Distributions.ipynb - Visualizes in-concept and out-of-concept activation distributions, demonstrating the separation in the extreme tails that enables the superactivator mechanism
Compare-Methods.ipynb - Shows quantitative results comparing concept detection performance and direct alignment inversion accuracy across different methods
Image-Concept-Evals.ipynb - Provides qualitative examples of superactivator tokens on image datasets, visualizing which patches activate most strongly for different concepts
Text-Concepts.ipynb - Shows qualitative examples of superactivator tokens in text datasets, highlighting which words activate most strongly for different concepts

Directory Structure

SuperActivators/
├── scripts/              # Main analysis scripts
│   ├── embed_*.py       # Embedding extraction
│   ├── compute_*.py     # Concept learning & activation
│   ├── validation_*.py  # Threshold optimization
│   └── pretrained_saes/ # SAE analysis scripts
├── notebooks/           # Jupyter notebooks for visualization
├── utils/               # Utility functions
├── Data/                # Dataset metadata and padding masks
├── requirements.txt     # Python dependencies
└── pyproject.toml       # Project configuration

Pipeline Output Directories (created during analysis):

Embeddings/ - Model embeddings for each dataset
Concepts/ - Learned concept vectors (avg, linsep, kmeans)
Cosine_Similarities/ - Cosine similarity activations
Distances/ - Signed distances for linear separators
GT_Samples/ - Ground truth sample indices
Thresholds/ - Optimal thresholds per concept
Quant_Results/ - Final detection metrics, F1 scores, precision/recall
activation_distributions/ - Activation distributions for visualization
prompt_results/ - Prompt-based concept extraction results
Best_Inversion_Percentiles_Cal/ - Optimal percentiles for inversion
Best_Detection_Percentiles_Cal/ - Optimal percentiles for detection
Per_Concept_PTM_Optimization/ - Optimal layer (percentthrumodel) for each concept

Each directory contains subdirectories for: CLEVR, Coco, Broden-Pascal, Broden-OpenSurfaces, Sarcasm, iSarcasm, GoEmotions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SuperActivator Mechanism Analysis

Table of Contents

Supported Models

Datasets

Vision Datasets

Text Datasets

Download Prepared Datasets

Installation

Main Concept Detection Analysis

Core Analysis Steps:

Extended Analysis (Optional):

Command Line Arguments

Alternative Analysis Methods

1. Prompt Concepts Pipeline

2. SAE (Sparse Autoencoder) Pipeline

For Images:

For Text:

Visualization & Analysis

Jupyter Notebooks

Directory Structure

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
Best_Detection_Percentiles_Cal		Best_Detection_Percentiles_Cal
Best_Inversion_Percentiles_Cal		Best_Inversion_Percentiles_Cal
Concepts		Concepts
Cosine_Similarities		Cosine_Similarities
Data		Data
Distances		Distances
Embeddings		Embeddings
GT_Samples		GT_Samples
Model_Layer_Mappings		Model_Layer_Mappings
Per_Concept_PTM_Optimization		Per_Concept_PTM_Optimization
Quant_Results		Quant_Results
Thresholds		Thresholds
activation_distributions		activation_distributions
notebooks		notebooks
prompt_results		prompt_results
scripts		scripts
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

BrachioLab/SuperActivators

Folders and files

Latest commit

History

Repository files navigation

SuperActivator Mechanism Analysis

Table of Contents

Supported Models

Datasets

Vision Datasets

Text Datasets

Download Prepared Datasets

Installation

Main Concept Detection Analysis

Core Analysis Steps:

Extended Analysis (Optional):

Command Line Arguments

Alternative Analysis Methods

1. Prompt Concepts Pipeline

2. SAE (Sparse Autoencoder) Pipeline

For Images:

For Text:

Visualization & Analysis

Jupyter Notebooks

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages