Bimodality of Sparse Autoencoder Features

The goal of this repo is to provide a full code implementation of the paper "Bimodality of Sparse Autoencoder Features is Still There and Can be Fixed" (spotlight at Mechanistic Interpretability Workshop at NeurIPS 2025) which can be read here.

Specifically:

Generation of the alignment score histograms for sparse autoencoders - a proposed simple measure of quality of features. Mathematically, it is just an inner product between encoder and decoder vectors.
The aligned training - a proposed method of training sparse autoencoder achieving improved results across several benchmarks.

For more details, history and motivation, consult the paper.

Setup

We use uv for dependencies. After installing uv, run uv sync to install all the requirements specified in the pyproject.toml file.

Philosophy

The core rule is provide an easy and direct way to replicate the main results. More precisely, every Figure in the paper should have a corresponding notebook here which is sufficient to reproduce it.

Figure 1: histograms of the alignment scores demonstrating bimodality
Figure 2: scatter plot of MCS vs the alignment score
Figure 3: scatter plot of autointerpretability vs the alignment score
Figure 4 and Figure 5: reconstruction error and dead neurons

Formatting

For formatting and linting we use ruff, black and isort. To check the formatting run make check-format, to fix format, run make format.

Building blocks

In this repo we used the following open source repositories:

For SAE training, we adapted the code from https://github.com/adamkarvonen/dictionary_learning_demo and https://github.com/saprmarks/dictionary_learning.
We added the code for the aligned training.
We changed the tracking from wandb to mlflow.
For the SAE evaluation, we adapted the code from https://github.com/adamkarvonen/SAEBench to be able to load the aligned SAE.
For autointerpretability scores, we adapted the code from https://github.com/HoagyC/sparse_coding and https://github.com/openai/automated-interpretability.
We modified the code to use open source Gemma 3 27B as a judge instead of the proprietary GPT model.

Citation

If you find this work useful, please cite our paper:

@inproceedings{
brzozowski2025bimodality,
title={Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed},
author={Micha{\l} Brzozowski},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=DbXKjT00yK}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
lib		lib
notebooks		notebooks
.gitignore		.gitignore
.python-version		.python-version
LICENCE		LICENCE
README.md		README.md
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bimodality of Sparse Autoencoder Features

Setup

Philosophy

Formatting

Building blocks

Citation

About

Uh oh!

Releases

Packages

Languages

License

SamsungLabs/sae_bimodality

Folders and files

Latest commit

History

Repository files navigation

Bimodality of Sparse Autoencoder Features

Setup

Philosophy

Formatting

Building blocks

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages