A web-based interactive demo for the GuessArena evaluation framework
Note
GuessArena Demo is a lightweight web application that simulates a card-guessing game with both player interaction and AI-versus-AI simulation.
It provides an intuitive, hands-on interface to explore the evaluation methodology introduced in our paper:
“GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning”
.
- Player Mode – Interactively play the guessing game by asking yes/no questions to an AI judge
- AI Simulation Mode – Observe two LLMs engaging in a self-play guessing process
- Leaderboard Tracking – Compare model performance across different domains and settings
- Customizable Decks – Use built-in card sets or define your own domain-specific decks
- Domain-Specific Scenarios – Evaluate reasoning in different industries and knowledge areas
- Python 3.8+
- Flask
- OpenAI API access
-
Clone the repository:
git clone git@github.com:Duguce/GuessArena-Demo.git cd GuessArena-Demo -
Create and activate a conda environment:
conda create -n guessarena python=3.10 conda activate guessarena -
Install dependencies:
pip install -r requirements.txt -
Configure your API settings in
config/settings.json
-
Set up your API keys in
config/models.inifor the AI models you want to use. -
Start the application:
python app.py -
Open your browser and go to
http://localhost:8888 -
Choose between Player Mode or AI Simulation
/config- Configuration files and model settings/data- Leaderboard data, logs, and card decks/prompts- Prompt templates for AI models/static- Static assets (CSS, JavaScript)/templates- HTML templates
The application includes several security features:
- Content Security Policy
- Rate limiting for API endpoints
- Path traversal prevention
- Secure file access
@inproceedings{
GuessArena,
title = "{G}uess{A}rena: Guess Who {I} Am? A Self-Adaptive Framework for Evaluating {LLM}s in Domain-Specific Knowledge and Reasoning",
author = "Yu, Qingchen and
Zheng, Zifan and
Chen, Ding and
Niu, Simin and
Tang, Bo and
Xiong, Feiyu and
Li, Zhiyu",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.534/",
doi = "10.18653/v1/2025.acl-long.534",
pages = "10897--10912",
ISBN = "979-8-89176-251-0",
}