This repository provides the official implementation of RepreGuard, a robust and efficient framework for detecting LLM-generated text (LGT) by leveraging hidden representation patterns, achieving state-of-the-art performance in both in-domain and out-of-domain settings.
- [2025.09.01] 🔥 Our work has been notified that will be presented at EMNLP 2025. Looking forward to an insightful discussion at the conference! See you in Suzhou, China!
- [2025.08.01] 🎉 Our paper is accepted by Transactions of the Association for Computational Linguistics (TACL 2025)!
We observed that the hidden representation patterns of the two types of texts show distinct differences, which serve as higher-dimensional features and strong signals for detecting LGT.
RepreGuard is based on the hypothesis that the internal representation patterns of LLMs, when processing human-written texts (HWT) and LLM-generated texts (LGT), are distinct and can be systematically modeled. By employing a surrogate model as an observer, we extract neural activation patterns and identify discriminative features. The resulting RepreScore enables robust classification between HWT and LGT with minimal training data.
- Zero-shot detection: Only a small sample of LGT/HWT pairs is needed for threshold calibration.
- Strong OOD robustness: Outperforms all previous methods across different models, domains, text sizes, and attacks.
- Resource-efficient: Competitive performance even with smaller surrogate models.
We use DetectRL dataset, including XSum, Writing Prompts, Yelp Review, and ArXiv abstracts, covering diverse domains and both HWT/LGT pairs generated by ChatGPT, Claude, Google-PaLM, Llama-2-70b, and RAID dataset, including the llama-chat, mistral-chat, mpt-chat, mistral, mpt and gpt2 using both greedy and sampling decoding strategies, with and without the application of repetition penalties. The detail in https://drive.google.com/file/d/1mdTBSRLTo7ob-ofithewt8eDTEzZpb7V/view
wget https://drive.google.com/uc?id=D1mdTBSRLTo7ob-ofithewt8eDTEzZpb7V
unzip datasets.zipconda create -n repre_guard python==3.10
conda activate repre_guard
pip install -r requirements.txtpython3 vis/repre_vis.py \
--model_path meta-llama/Llama-2-7b-hf \
--data_path datasets/detectrl_dataset/main_dataset/detectrl_train_dataset_llm_type_mix_llms_interleaved.json \
--sample_size 1000 \
--truncate_len 210 \
--batch_size 1 \
--save_plot_name Neural_Activity.pdfpython3 repreGuard_evaluation.py \
--model_name_or_path meta-llama/Llama-3.1-8B \
--train_data_path datasets/detectrl_dataset/main_dataset/detectrl_train_dataset_llm_type_ChatGPT.json \
--test_data_paths datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_ChatGPT.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Google-PaLM.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Claude-instant.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Llama-2-70b.json \
--ntrain 512 \
--batch_size 8 \
--rep_token 0.1 \
--bootstrap_iter -1You can specify the LLM surrogate model (e.g., Llama-3-8B, Phi-2, Gemma-2B-Instruct) via the --model argument.
| Detector | ID | OOD | 16-shots | Text w/ Attack | Text w/ Various Size | Text w/ Various Sampling Methods |
|---|---|---|---|---|---|---|
| RoBERTa | 90.85 | 84.26 | 83.60 | 72.97 | 46.81 | 82.17 |
| Binoculars | 88.18 | 86.19 | 88.07 | 78.15 | 81.70 | 85.63 |
| RepreGuard | 96.34 | 93.49 | 90.21 | 92.61 | 84.61 | 92.05 |
-
Attack: Paraphrase & Perturbation Attack.
-
Various Size: 64 - 256 token size.
-
Various Sampling Methods: w/ and w/o Dec. Strategy and Rep. Penalty.
Please see paper for more detail result.
If you find our paper/code useful, please cite us and give your ⭐!
@article{chen2025repreguard,
author = {Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao and Derek F. Wong},
title = {RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns},
journal = {Transactions of the Association for Computational Linguistics},
year = {2025},
url = {https://arxiv.org/abs/2508.13152},
note = {Accepted at TACL 2025}
}

