Skip to content

Code implementation for the paper "Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed"

License

Notifications You must be signed in to change notification settings

SamsungLabs/sae_bimodality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bimodality of Sparse Autoencoder Features

The goal of this repo is to provide a full code implementation of the paper "Bimodality of Sparse Autoencoder Features is Still There and Can be Fixed" (spotlight at Mechanistic Interpretability Workshop at NeurIPS 2025) which can be read here.

Specifically:

  • Generation of the alignment score histograms for sparse autoencoders - a proposed simple measure of quality of features. Mathematically, it is just an inner product between encoder and decoder vectors.
  • The aligned training - a proposed method of training sparse autoencoder achieving improved results across several benchmarks.

For more details, history and motivation, consult the paper.

Setup

We use uv for dependencies. After installing uv, run uv sync to install all the requirements specified in the pyproject.toml file.

Philosophy

The core rule is provide an easy and direct way to replicate the main results. More precisely, every Figure in the paper should have a corresponding notebook here which is sufficient to reproduce it.

Formatting

For formatting and linting we use ruff, black and isort. To check the formatting run make check-format, to fix format, run make format.

Building blocks

In this repo we used the following open source repositories:

Citation

If you find this work useful, please cite our paper:

@inproceedings{
brzozowski2025bimodality,
title={Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed},
author={Micha{\l} Brzozowski},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=DbXKjT00yK}
}

About

Code implementation for the paper "Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published