GitHub - AI45Lab/TELLME: Self-Explainability Enhancement of LLMs’ Representations

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

🛠️ Usage

Installation

conda create -n seer python=3.10
conda activate seer
pip install -r requirements.txt

Before running, you need to change the /path/to/model and /path/to/SEER in scripts to the true paths. Moreover, you need to set the wandb config in train.py or just disable it.
Run the Scripts

# experiments to verficate the effectiveness of SEER
sh domr.sh
# experiments to detoxificate LLMs
sh safety_seer_both_wokl.sh

💡Motivation

As shown in the figure, existing methods introduce additional "black-box" modules to explain "black-box" LLMs, increasing the potential uncertainty..

additional "black-box" modules -> self-explaining, without external modules and post-process
"black-box" LLMs -> enhance the explainability of LLMs' representations.

In a trustworthiness-related scenario, an ideal situation is that representations of similar concepts (e.g., related to "violence") fall into the same region, while representations from different concepts (e.g., "honesty," "bias," and "violence") are kept away from each other. In this way, we can easily know whether the inference logic of LLMs involves dangerous concepts and may inspire potential intervention. Therefore, we can improve LLMs’ self-explainability through disentangling between representations of different concepts.

📖 Method

Disentanglement of representations between concepts. SEER maximizes the representations’ similarities from the same concept and minimizes the representations’ similarities between different concepts. (InfoNCE loss)
Maintenance of LLMs’ general performance. SEER utilizes constraints of $l_2$ distance on representations and KL distance on probabilities before and after the disentanglement to maintain LLMs' general capabilities.

📃Results

Effectiveness of SEER on disentangling representations between concepts.

The consistent improvement in explainability and safety performance.

Acknowledgement

Part of the codes is borrowed from Circuit-Breakers.

Citation

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@misc{@misc{chen2025tellme,
      title={Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring}, 
      author={Guanxu Chen and Dongrui Liu and Tao Luo and Lijie Hu and Jing Shao},
      year={2025},
      eprint={2502.05242},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05242}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
dataset		dataset
imgs		imgs
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

🛠️ Usage

💡Motivation

📖 Method

📃Results

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AI45Lab/TELLME

Folders and files

Latest commit

History

Repository files navigation

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

🛠️ Usage

💡Motivation

📖 Method

📃Results

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages