Skip to content

xiaomi-research/guievalkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Unified Toolkit for Evaluating GUI Agents

GUIEvalKit is an open-source evaluation toolkit for GUI agents, allowing practitioners to easily assess these agents on various (offline) benchmarks. The main goal is to provide an easy-to-use, open-source toolkit that simplifies the evaluation process for researchers and developers, while ensuring that evaluation results can be easily reproduced.

Requirements and Installation

This work has been tested in the following environment:

  • python == 3.10.12
  • torch == 2.7.1+cu126
  • transformers == 4.56.1
  • vllm == 0.10.1

Supported Models

Model Model Name Organization
Qwen2.5-VL qwen2.5-vl-3/7/32/72b-instruct Alibaba
GUI-Owl gui-owl-7/32b Alibaba
UI-Venus ui-venus-navi-7/72b Ant Group
UI-TARS ui-tars-2/7/72b-sft, ui-tars-7/72b-dpo Bytedance
UI-TARS-1.5 ui-tars-1.5-7b Bytedance
MagicGUI magicgui-cpt/rft Honor
AgentCPM-GUI agentcpm-gui-8b ModelBest
MiMo-VL mimo-vl-7b-sft/rl, mimo-vl-7b-sft/rl-2508 Xiaomi
GLM-V glm-4.1v-9b-thinking, glm-4.5v Zhipu AI

Supported Benchmarks

Dataset Task Name Task Description
AndroidControl androidcontrol_low/high Agent 1680 episodes, (10814 - 653) steps
CAGUI cagui_agent Agent 600 episodes, 4516 steps
GUI Odyssey gui_odyssey Agent 1933 episodes, 29426 steps
AiTZ aitz Agent 506 episodes, 4724 steps

Data Preparation

Please follow the instructions to download and preprocess the datasets.

Data & Model Registration

Please update the files dataset_info.json and model_info.json with your own information.

Evaluation

python3 run.py \
  --model agentcpm-gui-8b \
  --dataset cagui_agent \
  --mode all \
  --outputs outputs/agentcpm-gui-8b/cagui_agent \
  --use-vllm

Arguments:

  • --model (str): Set the model name that is supported in GUIEvalKit (defined in config/model_info.json).
  • --dataset (str): Set the benchmark name that is supported in GUIEvalKit (defined in config/dataset_info.json).
  • --mode (str, default to 'all', choices are ['all', 'infer', 'eval']): When mode set to all, will perform both inference and evaluation; when set to infer, will only perform the inference; when set to eval, will only perform the evaluation.
  • --outputs (str, default to './outputs'): The directory to save evaluation results.
  • --batch-size (int, default to 64): The batch size used for inference.
  • --no-think: Use this argument if you want to disable the thinking mode (if applicable).
  • --use-vllm: Use this argument if you want to inference with vllm, otherwise transformers will be adopted.
  • --over-size: Use this argument for deploying large models on four GPUs and inferring with vllm.

Please check here for the detailed evaluation results.

Acknowledgement

This repo benefits from AgentCPM-GUI/eval and VLMEvalKit. Thanks for their wonderful works.

About

GUIEvalKit: Open-source Evaluation Toolkit for GUI Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages