This repository contains the official implementation of TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices.
🚧 Note: To ensure anonymity, we will release the data on Hugging Face after the review process.
Automatically detecting machine-generated text (MGT) is critical to maintain- ing the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on generic text gen- eration tasks (e.g., “Write an article about machine learning.”). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These task-specific MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce TSM-BENCH, a multilingual, multi-generator, and multi-task benchmark for evaluating MGT de- tectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (i) average detection accuracy drops by 10–40% compared to prior bench- marks, and (ii) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data—even across domains—but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. TSM-BENCH therefore provides a crucial foundation for developing and evaluating future models.
git clone tbd
cd TSM-BenchWe host the pre-trained models for Experiment 4 (generalisation) on Google Drive. You can also run the models yourself using the script generalise/code/train_hp_g.sh.
The following script will download and unzip the models into generalise/code/hp_len.
The download is ~18GB, unzipped size is ~25GB.
bash download_models.shNote: This script requires gdown. Install it via:
pip install gdownWe recommend using a virtual environment to manage dependencies.
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
export HF_HOME="" # optimal to manage hf cacheNote:
- Experiments 1-2 were run on either a single NVIDIA A100 80GB or two NVIDIA A100 40GB GPUs. ÷
- Experiments 3-5 were run on a single NVIDIA A100 40GB.
- We strongly recommend using GPUs to replicate results.
Run the following script to reproduce the results:
bash run_ots.shTo run black-box detectors, provide your OpenAI API key. If you skip this, only local models will be evaluated.
Note:
Zero-shot evaluations may take up to 1.5 days. We recommend splitting scripts across HPC jobs.
Supervised detectors run much faster. To run only those:
bash detect_train_hp.shTo run all:
export OPENAI_API_KEY=sk-...
bash run_detection.shThis will populate generalise/data/detect with files named:
trainFile_2_testFile_model_language.jsonl
bash run_generalisation.shTo generate the SHAP plot run:
bash run_shap_vals.shThis will populate generalise/data/detect with files named:
trainFile_2_testFile_model_language.jsonl
bash tbd shortly!Run the files in linguistic_analysis/la.sh with a GPU.
You can run this without QAFactEval if it causes issues.
To replicate our prompt selection evaluation:
We recommend using Conda: Clone and install QAFactEval into the current directory:
conda env create -f environment_qafe.yml
pip install -r requirements_qafe.txtClone and install QAFactEval into the current directory. Follow setup instructions at: https://github.com/salesforce/QAFactEval. Don't forget to add the model_folder in scorers/qafe.py.
Same procedure as above. Ensure gdown is installed.
bash download_sc.shThis example runs the evaluation for Vietnamese. Adjust the language as needed.
bash run_prompt_eval.shValuable contributions include:
- Implementing robust data cleaning with
mwparserfromhtml - Extending to more languages, adding generators, and expanding task coverage
tbd







