The goal of this repo is to provide a full code implementation of the paper "Bimodality of Sparse Autoencoder Features is Still There and Can be Fixed" (spotlight at Mechanistic Interpretability Workshop at NeurIPS 2025) which can be read here.
Specifically:
- Generation of the alignment score histograms for sparse autoencoders - a proposed simple measure of quality of features. Mathematically, it is just an inner product between encoder and decoder vectors.
- The aligned training - a proposed method of training sparse autoencoder achieving improved results across several benchmarks.
For more details, history and motivation, consult the paper.
We use uv for dependencies. After installing uv, run uv sync to install all the requirements specified in the pyproject.toml file.
The core rule is provide an easy and direct way to replicate the main results. More precisely, every Figure in the paper should have a corresponding notebook here which is sufficient to reproduce it.
- Figure 1: histograms of the alignment scores demonstrating bimodality
- Figure 2: scatter plot of MCS vs the alignment score
- Figure 3: scatter plot of autointerpretability vs the alignment score
- Figure 4 and Figure 5: reconstruction error and dead neurons
For formatting and linting we use ruff, black and isort. To check the formatting run make check-format, to fix format, run make format.
In this repo we used the following open source repositories:
- For SAE training, we adapted the code from https://github.com/adamkarvonen/dictionary_learning_demo and https://github.com/saprmarks/dictionary_learning.
- We added the code for the aligned training.
- We changed the tracking from wandb to mlflow.
- For the SAE evaluation, we adapted the code from https://github.com/adamkarvonen/SAEBench to be able to load the aligned SAE.
- For autointerpretability scores, we adapted the code from https://github.com/HoagyC/sparse_coding and https://github.com/openai/automated-interpretability.
- We modified the code to use open source Gemma 3 27B as a judge instead of the proprietary GPT model.
If you find this work useful, please cite our paper:
@inproceedings{
brzozowski2025bimodality,
title={Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed},
author={Micha{\l} Brzozowski},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=DbXKjT00yK}
}