A Unified Toolkit for Evaluating GUI Agents

GUIEvalKit is an open-source evaluation toolkit for GUI agents, allowing practitioners to easily assess these agents on various (offline) benchmarks. The main goal is to provide an easy-to-use, open-source toolkit that simplifies the evaluation process for researchers and developers, while ensuring that evaluation results can be easily reproduced.

Requirements and Installation

This work has been tested in the following environment:

python == 3.10.12
torch == 2.7.1+cu126
transformers == 4.56.1
vllm == 0.10.1

Supported Models

Model	Model Name	Organization
Qwen2.5-VL	`qwen2.5-vl-3/7/32/72b-instruct`	Alibaba
GUI-Owl	`gui-owl-7/32b`	Alibaba
UI-Venus	`ui-venus-navi-7/72b`	Ant Group
UI-TARS	`ui-tars-2/7/72b-sft`, `ui-tars-7/72b-dpo`	Bytedance
UI-TARS-1.5	`ui-tars-1.5-7b`	Bytedance
MagicGUI	`magicgui-cpt/rft`	Honor
AgentCPM-GUI	`agentcpm-gui-8b`	ModelBest
MiMo-VL	`mimo-vl-7b-sft/rl`, `mimo-vl-7b-sft/rl-2508`	Xiaomi
GLM-V	`glm-4.1v-9b-thinking`, `glm-4.5v`	Zhipu AI

Supported Benchmarks

Dataset	Task Name	Task	Description
AndroidControl	`androidcontrol_low/high`	Agent	1680 episodes, (10814 - 653) steps
CAGUI	`cagui_agent`	Agent	600 episodes, 4516 steps
GUI Odyssey	`gui_odyssey`	Agent	1933 episodes, 29426 steps
AiTZ	`aitz`	Agent	506 episodes, 4724 steps

Data Preparation

Please follow the instructions to download and preprocess the datasets.

Data & Model Registration

Please update the files dataset_info.json and model_info.json with your own information.

Evaluation

python3 run.py \
  --model agentcpm-gui-8b \
  --dataset cagui_agent \
  --mode all \
  --outputs outputs/agentcpm-gui-8b/cagui_agent \
  --use-vllm

Arguments:

--model (str): Set the model name that is supported in GUIEvalKit (defined in config/model_info.json).
--dataset (str): Set the benchmark name that is supported in GUIEvalKit (defined in config/dataset_info.json).
--mode (str, default to 'all', choices are ['all', 'infer', 'eval']): When mode set to all, will perform both inference and evaluation; when set to infer, will only perform the inference; when set to eval, will only perform the evaluation.
--outputs (str, default to './outputs'): The directory to save evaluation results.
--batch-size (int, default to 64): The batch size used for inference.
--no-think: Use this argument if you want to disable the thinking mode (if applicable).
--use-vllm: Use this argument if you want to inference with vllm, otherwise transformers will be adopted.
--over-size: Use this argument for deploying large models on four GPUs and inferring with vllm.

Please check here for the detailed evaluation results.

Acknowledgement

This repo benefits from AgentCPM-GUI/eval and VLMEvalKit. Thanks for their wonderful works.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
guieval		guieval
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Unified Toolkit for Evaluating GUI Agents

Requirements and Installation

Supported Models

Supported Benchmarks

Data Preparation

Data & Model Registration

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xiaomi-research/guievalkit

Folders and files

Latest commit

History

Repository files navigation

A Unified Toolkit for Evaluating GUI Agents

Requirements and Installation

Supported Models

Supported Benchmarks

Data Preparation

Data & Model Registration

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages