Skip to content

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.

License

Notifications You must be signed in to change notification settings

moketchups/awesome-mechanistic-interpretability

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

241 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Mechanistic Interpretability

Awesome License GitHub Contributors GitHub Last Commit GitHub Stars GitHub Forks

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components. This repository serves as a comprehensive and well-organized knowledge base for researchers, engineers, and enthusiasts working to uncover the inner workings of modern AI systems, particularly large language models (LLMs).

To ensure that the community stays updated on the latest developments, our repository is automatically updated with recent mechanistic interpretability papers from arXiv. This ensures timely access to new techniques, discoveries, and frameworks that are shaping the future of model transparency and alignment.

Note

📢 Announcement: Our paper from AIT Lab is now available on SSRN!
Title: Bridging the Black Box: A Survey on Mechanistic Interpretability in AI
If you find this paper interesting, please consider citing our work. Thank you for your support!

@article{somvanshi2025bridging,
  title={Bridging the Black Box: A Survey on Mechanistic Interpretability in AI},
  author={Somvanshi, Shriyank and Islam, Md Monzurul and Rafe, Amir and Tusti, Anannya Ghosh and Chakraborty, Arka and Baitullah, Anika and Chowdhury, Tausif Islam and Alnawmasi, Nawaf and Dutta, Anandi and Das, Subasish},
  journal={Available at SSRN 5345552},
  year={2025}
}

Whether you are investigating the circuits behind in-context learning, decoding attention heads in transformers, or exploring interpretability tools like activation patching and causal tracing, this collection serves as a centralized hub for everything related to Mechanistic Interpretability — enriched by original peer-reviewed contributions and hands-on research from the broader interpretability community.

Last Updated

January 28, 2026 at 01:19:34 AM UTC

Theorem

Papers (490)

Dedicated Publication Threads

Library

Tutorial

Written Tutorials

Video Tutorials

Contributing

We welcome contributions to this repository! If you have a resource that you believe should be included, please submit a pull request or open an issue. Contributions can include:

  • New libraries or tools related to mechanistic interpretability
  • Tutorials or guides that help users understand and implement mechanistic interpretability techniques
  • Research papers that advance the field of mechanistic interpretability
  • Any other resources that you find valuable for the community

How to Contribute

  1. Fork the repository.
  2. Create a new branch for your changes.
  3. Make your changes and commit them with a clear message.
  4. Push your changes to your forked repository.
  5. Submit a pull request to the main repository.

Before contributing, take a look at the existing resources to avoid duplicates.

License

This repository is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material, provided you give appropriate credit, link to the license, and indicate if changes were made.

Star History

Star History Chart

About

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%