A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components. This repository serves as a comprehensive and well-organized knowledge base for researchers, engineers, and enthusiasts working to uncover the inner workings of modern AI systems, particularly large language models (LLMs).
To ensure that the community stays updated on the latest developments, our repository is automatically updated with recent mechanistic interpretability papers from arXiv. This ensures timely access to new techniques, discoveries, and frameworks that are shaping the future of model transparency and alignment.
Note
📢 Announcement: Our paper from AIT Lab is now available on SSRN!
Title: Bridging the Black Box: A Survey on Mechanistic Interpretability in AI
If you find this paper interesting, please consider citing our work. Thank you for your support!
@article{somvanshi2025bridging,
title={Bridging the Black Box: A Survey on Mechanistic Interpretability in AI},
author={Somvanshi, Shriyank and Islam, Md Monzurul and Rafe, Amir and Tusti, Anannya Ghosh and Chakraborty, Arka and Baitullah, Anika and Chowdhury, Tausif Islam and Alnawmasi, Nawaf and Dutta, Anandi and Das, Subasish},
journal={Available at SSRN 5345552},
year={2025}
}Whether you are investigating the circuits behind in-context learning, decoding attention heads in transformers, or exploring interpretability tools like activation patching and causal tracing, this collection serves as a centralized hub for everything related to Mechanistic Interpretability — enriched by original peer-reviewed contributions and hands-on research from the broader interpretability community.
January 28, 2026 at 01:19:34 AM UTC
- Mechanistic Decomposition of Sentence Representations
- Domain Switching on the Pareto Front: Multi-Objective Deep Kernel Learning in Automated Piezoresponse Force Microscopy
- Rethinking Crowd-Sourced Evaluation of Neuron Explanations
- MIB: A Mechanistic Interpretability Benchmark
- Training Superior Sparse Autoencoders for Instruct Models
- RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
- Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
- A CRISP approach to QSP: XAI enabling fit-for-purpose models
- Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws
- Tug-of-war between idiom's figurative and literal meanings in LLMs
- TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
- Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
- A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
- Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
- The Vector Grounding Problem
- An analytic theory of creativity in convolutional diffusion models
- Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey
- Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization
- Is the end of Insight in Sight ?
- Forecasting Seasonal Influenza Epidemics with Physics-Informed Neural Networks
- Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models
- Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
- Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
- Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
- Do Large Language Models (Really) Need Statistical Foundations?
- Identifying interactions across brain areas while accounting for individual-neuron dynamics with a Transformer-based variational autoencoder
- SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
- Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
- : Interpreting and leveraging semantic information in diffusion models
- Circuit Stability Characterizes Language Model Generalization
- Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs
- Planning in a recurrent neural network that plays Sokoban
- Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
- BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
- Enhancing Automated Interpretability with Output-Centric Feature Descriptions
- Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks
- Sparsification and Reconstruction from the Perspective of Representation Geometry
- Mitigating Overthinking in Large Reasoning Models via Manifold Steering
- Understanding Synthetic Context Extension via Retrieval Heads
- In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention
- MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
- The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
- Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
- From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
- Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks
- MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding
- The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models
- DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces
- Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
- The Origins of Representation Manifolds in Large Language Models
- The Remarkable Robustness of LLMs: Stages of Inference?
- Monet: Mixture of Monosemantic Experts for Transformers
- Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
- PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs
- Revisiting Transformers with Insights from Image Filtering
- Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
- Understanding the Repeat Curse in Large Language Models from a Feature Perspective
- Sectoral Coupling in Linguistic State Space
- SAE-V: Interpreting Multimodal Models for Enhanced Alignment
- Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders
- Interpretability and Generalization Bounds for Learning Spatial Physics
- A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
- Mechanistic Interpretability in the Presence of Architectural Obfuscation
- Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack)
- Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
- Validating Mechanistic Interpretations: An Axiomatic Approach
- Six Fallacies in Substituting Large Language Models for Human Participants
- Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation
- From memories to maps: Mechanisms of in context reinforcement learning in transformers
- Measuring and Guiding Monosemanticity
- Emergent collective dynamics from motile photokinetic organisms
- Mechanistic Interpretability Needs Philosophy
- Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition
- Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
- Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
- Bilinear MLPs enable weight-based mechanistic interpretability
- From Memories to Maps: Mechanisms of In-Context Reinforcement Learning in Transformers
- Amortizing personalization in virtual brain twins
- Stochastic Parameter Decomposition
- Understanding Verbatim Memorization in LLMs Through Circuit Discovery
- Mechanistic Interpretability of Emotion Inference in Large Language Models
- Data-Driven Multiscale Topology Optimization of Spinodoid Architected Materials with Controllable Anisotropy
- SAFER: Probing Safety in Reward Models with Sparse Autoencoder
- Prompting as Scientific Inquiry
- Emerging AI Approaches for Cancer Spatial Omics
- Learning Modular Exponentiation with Transformers
- Constraint-Guided Symbolic Regression for Data-Efficient Kinetic Model Discovery
- Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
- Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks
- Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
- Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations
- Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding
- Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs
- Dynamical Archetype Analysis: Autonomous Computation
- A statistical approach to latent dynamic modeling with differential equations
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
- Mechanistic Indicators of Understanding in Large Language Models
- Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers
- Towards Interpretable Drug-Drug Interaction Prediction: A Graph-Based Approach with Molecular and Network-Level Explanations
- Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
- A PBN-RL-XAI Framework for Discovering a "Hit-and-Run" Therapeutic Strategy in Melanoma
- Propensity score weighting across counterfactual worlds: longitudinal effects under positivity violations
- Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
- Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
- CytoSAE: Interpretable Cell Embeddings for Hematology
- FADE: Why Bad Descriptions Happen to Good Features
- Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models
- Teach Old SAEs New Domain Tricks with Boosting
- Insights into a radiology-specialised multimodal large language model with sparse autoencoders
- Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
- Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
- Understanding Matching Mechanisms in Cross-Encoders
- Decoding Translation-Related Functional Sequences in 5'UTRs Using Interpretable Deep Learning Models
- Probing Ranking LLMs: A Mechanistic Analysis for Information Retrieval
- Deep Learning for Blood-Brain Barrier Permeability Prediction
- Residual Koopman Model Predictive Control for Enhanced Vehicle Dynamics with Small On-Track Data Input
- CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing
- HumorDB: Can AI understand graphical humor?
- Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
- A nonparametric approach to practical identifiability of nonlinear mixed effects models
- Large Language Models Are Human-Like Internally
- How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation
- Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
- Semiclassical Spin Exchange via Temperature-Dependent Transition States
- Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders
- Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
- How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
- Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
- Modeling the Temperature-Humidity Coupling Dynamics of Soybean Pod Borer Population and Assessing the Predictive Performance of the PCM-NN Algorithm
- I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
- Covariance spectrum in nonlinear recurrent neural networks
- Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AI
- Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
- Surgical Knowledge Rewrite in Compact LLMs: An 'Unlearn-then-Learn' Strategy with (IA^3IA^3) for Localized Factual Modulation and Catastrophic Forgetting Mitigation
- Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
- Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
- BiasGym: Fantastic Biases and How to Find (and Remove) Them
- From Transformer to Biology: A Hierarchical Model for Attention in Complex Problem-Solving
- How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
- Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
- BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them
- eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM
- Maximum Entropy Models for Unimodal Time Series: Case Studies of Universe 25 and St. Matthew Island
- Mantis: A Simulation-Grounded Foundation Model for Disease Forecasting
- CALYPSO: Forecasting and Analyzing MRSA Infection Patterns with Community and Healthcare Transmission Dynamics
- Counterfactual Probabilistic Diffusion with Expert Models
- Modeling GRNs with a Probabilistic Categorical Framework
- Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
- LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
- From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
- Fidelity Isn't Accuracy: When Linearly Decodable Functions Fail to Match the Ground Truth
- Beyond Transcription: Mechanistic Interpretability in ASR
- Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
- KL-based self-distillation for large language models
- Adversarial Examples Are Not Bugs, They Are Superposition
- LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components
- Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
- Rethinking scale in network neuroscience: Contributions and opportunities at the nanoscale
- Do VLMs Have Bad Eyes? Diagnosing Compositional Failures via Mechanistic Interpretability
- Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance
- Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5
- Linear Power System Modeling and Analysis Across Wide Operating Ranges: A Hierarchical Neural State-Space Equation Approach
- Even Heads Fix Odd Errors: Mechanistic Discovery and Surgical Repair in Transformer Attention
- RelP: Faithful and Efficient Circuit Discovery via Relevance Patching
- Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision
- Mechanistic interpretability for steering vision-language-action models
- Can LLMs Lie? Investigation beyond Hallucination
- Challenges in Understanding Modality Conflict in Vision-Language Models
- Non-Linear Model-Based Sequential Decision-Making in Agriculture
- Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function
- Pulling Back the Curtain on ReLU Networks
- Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces
- Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects
- Interpreting Transformer Architectures as Implicit Multinomial Regression
- ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
- Towards explainable decision support using hybrid neural models for logistic terminal automation
- Measuring Uncertainty in Transformer Circuits with Effective Information Consistency
- Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Hop Arithmetic Reasoning
- Data-driven discovery of dynamical models in biology
- Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
- The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
- Interpretability as Alignment: Making Internal Understanding a Design Principle
- Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition
- Decoding the Stability of Transition-Metal Alloys with Theory-infused Deep Learning
- An Agentic AI Workflow to Simplify Parameter Estimation of Complex Differential Equation Systems
- Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
- The power of dynamic causality in observer-based design for soft sensor applications
- Modelling Under-Reported Data: Pitfalls of Naïve Approaches and a New Statistical Framework for Epidemic Curve Reconstruction
- The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
- Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content
- Swarm Intelligence for Chemical Reaction Optimisation
- Unified Spatiotemopral Physics-Informed Learning (USPIL): A Framework for Modeling Complex Predator-Prey Dynamics
- Learning Mechanistic Subtypes of Neurodegeneration with a Physics-Informed Variational Autoencoder Mixture Model
- Unified Spatiotemporal Physics-Informed Learning (USPIL): A Framework for Modeling Complex Predator-Prey Dynamics
- Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
- DeepMech: A Machine Learning Framework for Chemical Reaction Mechanism Prediction
- Modeling Transformers as complex networks to analyze learning dynamics
- Bayesian Calibration and Model Assessment of Cell Migration Dynamics with Surrogate Model Integration
- Learning From Simulators: A Theory of Simulation-Grounded Learning
- Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
- Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models
- Tikhonov-Fenichel Reductions and their Application to a Novel Modelling Approach for Mutualism
- A Machine Learning Framework for Pathway-Driven Therapeutic Target Discovery in Metabolic Disorders
- From Parameters to Performance: A Data-Driven Study on LLM Structure and Development
- Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
- BioBO: Biology-informed Bayesian Optimization for Perturbation Design
- Interpreting ResNet-based CLIP via Neuron-Attention Decomposition
- Integrating Mechanistic Modeling and Machine Learning to Study CD4+/CD8+ CAR-T Cell Dynamics with Tumor Antigen Regulation
- Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics
- Binary Autoencoder for Mechanistic Interpretability of Large Language Models
- CLUE: Conflict-guided Localization for LLM Unlearning Framework
- Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
- GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine
- Towards Atoms of Large Language Models
- RAPTOR-GEN: RApid PosTeriOR GENerator for Bayesian Learning in Biomanufacturing
- Concept-SAE: Active Causal Probing of Visual Model Behavior
- Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
- Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
- Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
- Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs
- LLM Interpretability with Identifiable Temporal-Instantaneous Representation
- Mechanistic Fine-tuning for In-context Learning
- Bayesian Inference for Sexual Contact Networks Using Longitudinal Survey Data
- Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
- Thrust based on changes in angular momentum
- Latent Concept Disentanglement in Transformer-based Language Models
- ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
- Minimalist Explanation Generation and Circuit Discovery
- Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions
- Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF
- Excitonic Energy Transfer in Red Algal Photosystem I Reveals an Evolutionary Bridge between Cyanobacteria and Plants
- Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis
- From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
- Measuring Sparse Autoencoder Feature Sensitivity
- Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
- Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
- Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
- Feature Identification via the Empirical NTK
- Commutative algebra neural network reveals genetic origins of diseases
- Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
- BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
- When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
- Integrative modelling of biomolecular dynamics
- Interpreting Language Models Through Concept Descriptions: A Survey
- Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification
- Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
- Understanding Addition and Subtraction in Transformers
- Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
- Learning Explicit Single-Cell Dynamics Using ODE Representations
- Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models
- Towards CONUS-Wide ML-Augmented Conceptually-Interpretable Modeling of Catchment-Scale Precipitation-Storage-Runoff Dynamics
- Deep learning for flash drought forecasting and interpretation
- Mechanistic Interpretability of Socio-Political Frames in Language Models
- Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
- The Argument is the Explanation: Structured Argumentation for Trust in Agents
- Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
- Decomposing Attention To Find Context-Sensitive Neurons
- SoC-DT: Standard-of-Care Aligned Digital Twins for Patient-Specific Tumor Dynamics
- Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
- Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models
- Mechanistic-statistical inference of mosquito dynamics from mark-release-recapture data
- Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
- The Logical Implication Steering Method for Conditional Interventions on Transformer Generation
- Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning
- Causal Abstractions, Categorically Unified
- Visual Representations inside the Language Model
- BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods
- Tug-of-war between idioms' figurative and literal interpretations in LLMs
- ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
- Advancing AI Research Assistants with Expert-Involved Learning
- Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin
- Iterated Agent for Symbolic Regression
- RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
- InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
- Egocentric Visual Navigation through Hippocampal Sequences
- Causality \neq\neq Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
- Impact of Oxygen on DNA Damage Distribution in 3D Genome and Its Correlation to Oxygen Enhancement Ratio under High LET Irradiation
- Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
- QLENS: Towards A Quantum Perspective of Language Transformers
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
- Data-Driven Topology Optimization for Multiscale Biomimetic Spinodal Design
- Physical models of embryonic epithelial healing: A review
- Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
- Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models
- Constrained belief updates explain geometric structures in transformer representations
- CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions
- Functional and parametric identifiability for universal differential equations applied to chemical reaction networks
- Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning
- Circuit Insights: Towards Interpretability Beyond Activations
- Causal Time Series Modeling of Supraglacial Lake Evolution in Greenland under Distribution Shift
- Game-Theoretic Discovery of Quantum Error-Correcting Codes Through Nash Equilibria
- Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability
- DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
- Extracting Rule-based Descriptions of Attention Features in Transformers
- How role-play shapes relevance judgment in zero-shot LLM rankers
- Layer Specialization Underlying Compositional Reasoning in Transformers
- Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations
- Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling
- Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models
- Base Models Know How to Reason, Thinking Models Learn When
- I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
- A Class of Markovian Self-Reinforcing Processes with Power-Law Distributions
- Prospects for Using Artificial Intelligence to Understand Intrinsic Kinetics of Heterogeneous Catalytic Reactions
- Foundation Models for Discovery and Exploration in Chemical Space
- Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
- MolBridge: Atom-Level Joint Graph Refinement for Robust Drug-Drug Interaction Event Prediction
- Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
- Some Attention is All You Need for Retrieval
- Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers
- ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
- Mapping Faithful Reasoning in Language Models
- Transformer brain encoders explain human high-level visual responses
- Mechanistic Interpretability for Neural TSP Solvers
- Mechanism-Guided Residual Lifting and Control Consistent Modeling for Pneumatic Drying Processes
- Overshoot-resolved transition modeling based on field inversion and symbolic regression
- Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
- PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization
- Interpreting and Mitigating Unwanted Uncertainty in LLMs
- Sparsity and Superposition in Mixture of Experts
- Mechanistic Interpretability of RNNs emulating Hidden Markov Models
- FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
- Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
- Chain-of-Thought Hijacking
- In Defence of Post-hoc Explainability
- MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
- BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
- StreetMath: Study of LLMs' Approximation Behaviors
- Modelling ion channels with a view towards identifiability
- Atlas-Alignment: Making Interpretability Transferable Across Language Models
- Pregnancy as a dynamical paradox: robustness, control and birth onset
- Space as Time Through Neuron Position Learning
- TRISKELION-1: Unified Descriptive-Predictive-Generative AI
- Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI
- Automatically Finding Rule-Based Neurons in OthelloGPT
- Causal Graph Neural Networks for Healthcare
- Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
- LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS
- Addressing divergent representations from causal interventions on neural networks
- APP: Accelerated Path Patching with Task-Specific Pruning
- SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
- Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
- The Trilemma of Truth in Large Language Models
- SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
- Automated Circuit Interpretation via Probe Prompting
- Rank-1 LoRAs Encode Interpretable Reasoning Signals
- Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm
- Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
- Fractional neural attention for efficient multiscale sequence processing
- Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
- Decomposition of Small Transformer Models
- Weight-sparse transformers have interpretable circuits
- An Automated Framework for Analyzing Structural Evolution in On-the-fly Non-adiabatic Molecular Dynamics Using Autoencoder and Multiple Molecular Descriptors
- Bridging the genotype-phenotype gap with generative artificial intelligence
- From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability
- Explainable deep learning framework for cancer therapeutic target prioritization leveraging PPI centrality and node embeddings
- Comment on "Repair of DNA Double-Strand Breaks Leaves Heritable Impairment to Genome Function"
- Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control
- Differentiable Electrochemistry: A paradigm for uncovering hidden physical phenomena in electrochemical systems
- nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
- Judicial Sentencing Prediction Based on Hybrid Models and Two-Stage Learning Algorithms
- Chromatographic Peak Shape from Stochastic Model: Analytic Time-Domain Expression in Terms of Physical Parameters and Conditions under which Heterogeneity Reduces Tailing
- Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design
- Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
- Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry
- Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25
- Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
- Vector Arithmetic in Concept and Token Subspaces
- The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems
- Understanding Counting Mechanisms in Large Language and Vision-Language Models
- BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
- Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
- Approximate Bayesian Computation Made Easy: A Practical Guide to ABC-SMC for Dynamical Systems with \texttt{pymc}
- Mechanistic Interpretability for Transformer-based Time Series Classification
- Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model
- Interpretability for Time Series Transformers using A Concept Bottleneck Framework
- Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations
- A race to belief: How Evidence Accumulation shapes trust in AI and Human informants
- FoldSAE: Learning to Steer Protein Folding Through Sparse Representations
- Unsupervised decoding of encoded reasoning using language model interpretability
- TrendGNN: Towards Understanding of Epidemics, Beliefs, and Behaviors
- VCWorld: A Biological World Model for Virtual Cell Simulation
- EXP-CAM: Explanation Generation and Circuit Discovery Using Classifier Activation Matching
- HyperADRs: A Hierarchical Hypergraph Framework for Drug-Gene-ADR Prediction
- ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics
- Translating Measures onto Mechanisms: The Cognitive Relevance of Higher-Order Information
- Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks
- Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
- Approximate Bayesian Inference on Mechanisms of Network Growth and Evolution
- AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
- Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
- Neural Policy Composition from Free Energy Minimization
- Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective
- Minuet: A Diffusion Autoencoder for Compact Semantic Compression of Multi-Band Galaxy Images
- Sparse Attention Post-Training for Mechanistic Interpretability
- Mechanistic Interpretability of Antibody Language Models Using SAEs
- Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
- On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability
- A network-driven framework for enhancing gene-disease association studies in coronary artery disease
- XMCQDPT2-Fidelity Transfer-Learning Potentials and a Wavepacket Oscillation Model with Power-Law Decay for Ultrafast Photodynamics
- ExPUFFIN: Thermodynamic Consistent Viscosity Prediction in an Extended Path-Unifying Feed-Forward Interfaced Network
- Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
- Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
- A microstructural rheological model for transient creep in polycrystalline ice
- Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
- Bayesian Co-Navigation of a Computational Physical Model and AFM Experiment to Autonomously Survey a Combinatorial Materials Library
- Detecting Hallucinations in Graph Retrieval-Augmented Generation via Attention Patterns and Semantic Alignment
- SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
- Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
- Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders
- SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
- Physics-informed neural network for fatigue life prediction of irradiated austenitic and ferritic/martensitic steels
- ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts
- Learning continuous SOC-dependent thermal decomposition kinetics for Li-ion cathodes using KA-CRNNs
- Who is In Charge? Dissecting Role Conflicts in Instruction Following
- SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
- AI Epidemiology: achieving explainable AI through expert oversight patterns
- Machine Learning Framework for Thrombosis Risk Prediction in Rotary Blood Pumps
- RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling
- R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimer's Disease Progression
- Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
- Block-Recurrent Dynamics in Vision Transformers
- Sign-Aware Multistate Jaccard Kernels and Geometry for Real and Complex-Valued Signals
- The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds
- Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding
- Information is localized in growing network models
- Mono- and Polyauxic Growth Kinetics: A Semi-Mechanistic Framework for Complex Biological Dynamics
- Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation
- Mechanistic Analysis of Circuit Preservation in Federated Learning
- A Paradigm Shift in Human Neuroscience Research: Progress, Prospects, and a Proof of Concept for Population Neuroscience
- EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why -- Measuring Mechanistic Multiplicity Across Training Runs
- Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability
- BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature
- Connecting strain rate dependence of fcc metals to dislocation avalanche signatures
- Towards mechanistic understanding in a data-driven weather model: internal activations reveal interpretable physical features
- "X-ray Coulomb Counting" to understand electrochemical systems
- Bridging Visual Intuition and Chemical Expertise: An Autonomous Analysis Framework for Nonadiabatic Dynamics Simulations via Mentor-Engineer-Student Collaboration
- Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation
- Emergent World Beliefs: Exploring Transformers in Stochastic Games
- How much neuroscience does a neuroscientist need to know?
- Trustworthy Data-Driven Wildfire Risk Prediction and Understanding in Western Canada
- Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding
- Discovering and Causally Validating Emotion-Sensitive Neurons in Large Audio-Language Models
- When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability
- Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
- Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control
- A Pre-trained Reaction Embedding Descriptor Capturing Bond Transformation Patterns
- Local Multimodal Dynamics in Mixed Ionic-Electronic Conductors and Their Fingerprints in Organic Electrochemical Transistor Operation
- When Models Manipulate Manifolds: The Geometry of a Counting Task
- Interpreting Transformers Through Attention Head Intervention
- Analytical review of nanoplastic bioaccumulation data and a unified toxicokinetic model: from teleosts to human brain
- Molecular signatures of pressure-induced phase transitions in a lipid bilayer
- A Backpropagation-Free Feedback-Hebbian Network for Continual Learning Dynamics
- AlignSAE: Concept-Aligned Sparse Autoencoders
- Physics-constrained Gaussian Processes for Predicting Shockwave Hugoniot Curves
- Time Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models
- Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
- LLM-Powered Social Digital Twins: A Framework for Simulating Population Behavioral Response to Policy Interventions
- Vocabulary Expansion of Large Language Models via Kullback-Leibler-Based Self-Distillation
- Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
- Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models
- Diagnosing Generalization Failures in Fine-Tuned LLMs: A Cross-Architectural Study on Phishing Detection
- Dedifferentiation stabilizes stem cell lineages: From CTMC to diffusion theory and thresholds
- An Epidemiological Modeling Take on Religion Dynamics
- From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
- Metabolomic Biomarker Discovery for ADHD Diagnosis Using Interpretable Machine Learning
- Mechanistic Learning for Survival Prediction in NSCLC Using Routine Blood Biomarkers and Tumor Kinetics
- From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models
- Reasoning Models Generate Societies of Thought
- BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models
- Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
- Patterning: The Dual of Interpretability
- Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition
- The Physics of the Dancing \emph{Deity}: Coupled Oscillators in Himalayan Processions
- Race, Ethnicity and Their Implication on Bias in Large Language Models
- A Machine Learning--Based Surrogate EKMA Framework for Diagnosing Urban Ozone Formation Regimes: Evidence from Los Angeles
- Long-term prediction of ENSO with physics-guided Deep Echo State Networks
- Persistent Sheaf Laplacian Analysis of Protein Stability and Solubility Changes upon Mutation
- Single-Node Wilson--Cowan Model Accounts for Speech-Evoked γγ-Band Deficits in Schizophrenia
- LLM Powered Social Digital Twins: A Framework for Simulating Population Behavioral Response to Policy Interventions
- DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction
- White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification
- Emergence and Evolution of Interpretable Concepts in Diffusion Models
- Latent Causal Diffusions for Single-Cell Perturbation Modeling
- A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models
- Interpretability of the Intent Detection Problem: A New Approach
We welcome contributions to this repository! If you have a resource that you believe should be included, please submit a pull request or open an issue. Contributions can include:
- New libraries or tools related to mechanistic interpretability
- Tutorials or guides that help users understand and implement mechanistic interpretability techniques
- Research papers that advance the field of mechanistic interpretability
- Any other resources that you find valuable for the community
- Fork the repository.
- Create a new branch for your changes.
- Make your changes and commit them with a clear message.
- Push your changes to your forked repository.
- Submit a pull request to the main repository.
Before contributing, take a look at the existing resources to avoid duplicates.
This repository is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material, provided you give appropriate credit, link to the license, and indicate if changes were made.