GUIEvalKit is an open-source evaluation toolkit for GUI agents, allowing practitioners to easily assess these agents on various (offline) benchmarks. The main goal is to provide an easy-to-use, open-source toolkit that simplifies the evaluation process for researchers and developers, while ensuring that evaluation results can be easily reproduced.
This work has been tested in the following environment:
python == 3.10.12torch == 2.7.1+cu126transformers == 4.56.1vllm == 0.10.1
| Model | Model Name | Organization |
|---|---|---|
| Qwen2.5-VL | qwen2.5-vl-3/7/32/72b-instruct |
Alibaba |
| GUI-Owl | gui-owl-7/32b |
Alibaba |
| UI-Venus | ui-venus-navi-7/72b |
Ant Group |
| UI-TARS | ui-tars-2/7/72b-sft, ui-tars-7/72b-dpo |
Bytedance |
| UI-TARS-1.5 | ui-tars-1.5-7b |
Bytedance |
| MagicGUI | magicgui-cpt/rft |
Honor |
| AgentCPM-GUI | agentcpm-gui-8b |
ModelBest |
| MiMo-VL | mimo-vl-7b-sft/rl, mimo-vl-7b-sft/rl-2508 |
Xiaomi |
| GLM-V | glm-4.1v-9b-thinking, glm-4.5v |
Zhipu AI |
| Dataset | Task Name | Task | Description |
|---|---|---|---|
| AndroidControl | androidcontrol_low/high |
Agent | 1680 episodes, (10814 - 653) steps |
| CAGUI | cagui_agent |
Agent | 600 episodes, 4516 steps |
| GUI Odyssey | gui_odyssey |
Agent | 1933 episodes, 29426 steps |
| AiTZ | aitz |
Agent | 506 episodes, 4724 steps |
Please follow the instructions to download and preprocess the datasets.
Please update the files dataset_info.json and model_info.json with your own information.
python3 run.py \
--model agentcpm-gui-8b \
--dataset cagui_agent \
--mode all \
--outputs outputs/agentcpm-gui-8b/cagui_agent \
--use-vllmArguments:
--model (str): Set the model name that is supported in GUIEvalKit (defined inconfig/model_info.json).--dataset (str): Set the benchmark name that is supported in GUIEvalKit (defined inconfig/dataset_info.json).--mode (str, default to 'all', choices are ['all', 'infer', 'eval']): Whenmodeset toall, will perform both inference and evaluation; when set toinfer, will only perform the inference; when set toeval, will only perform the evaluation.--outputs (str, default to './outputs'): The directory to save evaluation results.--batch-size (int, default to 64): The batch size used for inference.--no-think: Use this argument if you want to disable the thinking mode (if applicable).--use-vllm: Use this argument if you want to inference withvllm, otherwisetransformerswill be adopted.--over-size: Use this argument for deploying large models on four GPUs and inferring withvllm.
Please check here for the detailed evaluation results.
This repo benefits from AgentCPM-GUI/eval and VLMEvalKit. Thanks for their wonderful works.