TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

This repository contains the official implementation of TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices.

🚧 Note: To ensure anonymity, we will release the data on Hugging Face after the review process.

TSM-Bench

Automatically detecting machine-generated text (MGT) is critical to maintain- ing the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on generic text gen- eration tasks (e.g., “Write an article about machine learning.”). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These task-specific MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce TSM-BENCH, a multilingual, multi-generator, and multi-task benchmark for evaluating MGT de- tectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (i) average detection accuracy drops by 10–40% compared to prior bench- marks, and (ii) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data—even across domains—but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. TSM-BENCH therefore provides a crucial foundation for developing and evaluating future models.

Requirements

1. Clone the Repository

git clone tbd
cd TSM-Bench

2. Download Pretrained Models

We host the pre-trained models for Experiment 4 (generalisation) on Google Drive. You can also run the models yourself using the script generalise/code/train_hp_g.sh.
The following script will download and unzip the models into generalise/code/hp_len.
The download is ~18GB, unzipped size is ~25GB.

bash download_models.sh

Note: This script requires gdown. Install it via:

pip install gdown

3. Set Up the Python Environment

We recommend using a virtual environment to manage dependencies.

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
export HF_HOME="" # optimal to manage hf cache

Main Results

Note:

Experiments 1-2 were run on either a single NVIDIA A100 80GB or two NVIDIA A100 40GB GPUs. ÷
Experiments 3-5 were run on a single NVIDIA A100 40GB.
We strongly recommend using GPUs to replicate results.

Experiment 1: Off-the shelf detectos

Run the following script to reproduce the results:

bash run_ots.sh

Experiment 2: Supervised and zero-shot detectors

To run black-box detectors, provide your OpenAI API key. If you skip this, only local models will be evaluated.

Note:
Zero-shot evaluations may take up to 1.5 days. We recommend splitting scripts across HPC jobs.
Supervised detectors run much faster. To run only those:

bash detect_train_hp.sh

To run all:

export OPENAI_API_KEY=sk-...
bash run_detection.sh

Experiment 3: Out-of-domain generalisation

This will populate generalise/data/detect with files named: trainFile_2_testFile_model_language.jsonl

bash run_generalisation.sh

Experiment 4: Feature analysis

To generate the SHAP plot run:

bash run_shap_vals.sh

Experiment 5: Cross-task generalisation

This will populate generalise/data/detect with files named: trainFile_2_testFile_model_language.jsonl

bash tbd shortly!

Other Results

Linguistic Analysis

Run the files in linguistic_analysis/la.sh with a GPU.

Prompt Selection

You can run this without QAFactEval if it causes issues.

To replicate our prompt selection evaluation:

1. Create a Conda Environment for QAFactEval

We recommend using Conda: Clone and install QAFactEval into the current directory:

conda env create -f environment_qafe.yml
pip install -r requirements_qafe.txt

Clone and install QAFactEval into the current directory. Follow setup instructions at: https://github.com/salesforce/QAFactEval. Don't forget to add the model_folder in scorers/qafe.py.

2. Download Style Classifiers

Same procedure as above. Ensure gdown is installed.

bash download_sc.sh

3. Run the Evaluation

This example runs the evaluation for Vietnamese. Adjust the language as needed.

bash run_prompt_eval.sh

Contributing

Valuable contributions include:

Implementing robust data cleaning with mwparserfromhtml
Extending to more languages, adding generators, and expanding task coverage

Citation

tbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench

Requirements

1. Clone the Repository

2. Download Pretrained Models

3. Set Up the Python Environment

Main Results

Experiment 1: Off-the shelf detectos

Experiment 2: Supervised and zero-shot detectors

Experiment 3: Out-of-domain generalisation

Experiment 4: Feature analysis

Experiment 5: Cross-task generalisation

Other Results

Linguistic Analysis

Prompt Selection

1. Create a Conda Environment for QAFactEval

2. Download Style Classifiers

3. Run the Evaluation

Contributing

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
collection		collection
detectors		detectors
generalise		generalise
linguistic_analysis		linguistic_analysis
mgt		mgt
paragraphs		paragraphs
scorers		scorers
summaries		summaries
tst		tst
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_models.sh		download_models.sh
download_sc.sh		download_sc.sh
environment_qafe.yml		environment_qafe.yml
requirements.txt		requirements.txt
requirements_qafe.txt		requirements_qafe.txt
run_detection.sh		run_detection.sh
run_generalisation.sh		run_generalisation.sh
run_ots.sh		run_ots.sh
run_prompt_eval.sh		run_prompt_eval.sh
run_shap_vals.sh		run_shap_vals.sh

License

gerritq/TSM_Bench

Folders and files

Latest commit

History

Repository files navigation

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench

Requirements

1. Clone the Repository

2. Download Pretrained Models

3. Set Up the Python Environment

Main Results

Experiment 1: Off-the shelf detectos

Experiment 2: Supervised and zero-shot detectors

Experiment 3: Out-of-domain generalisation

Experiment 4: Feature analysis

Experiment 5: Cross-task generalisation

Other Results

Linguistic Analysis

Prompt Selection

1. Create a Conda Environment for QAFactEval

2. Download Style Classifiers

3. Run the Evaluation

Contributing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages