This project is designed to evaluate the performance of multimodal large language models on the SpatialViz-Bench benchmark.
- 2025.5.28 Released SpatialViz-Bench, the first benchmark to evaluate spatial visualization for MLLMs.
- 2026.1.5 EASI (Holistic Evaluation of Multimodal LLMs on Spatial Intelligence) integrated SpatialViz-Bench into their open-source evaluation platform. EASI on GitHub
- 2026.1.26 🎉 Accepted as a poster at ICLR 2026.
- 2026.2.2 Released the code for generating data.
-
Clone the Repository
-
Create and Activate a Virtual Environment and Install Dependencies:
You should follow all the requirements specified in the code repositories of the open-source models when setting up the environment. For the evaluation of closed-source models, only the following packages are required.
openai datasets tqdm
Before running the script, you may need to configure API keys. The script accepts these keys via command-line arguments.
- Qwen API Key: For accessing Qwen series models.
- Doubao API Key: For accessing Doubao series models.
- OpenAI API Key: For accessing OpenAI models (e.g., GPT-4o).
- Gemini API Key: For accessing Gemini series models.
- OpenRouter API Key: For accessing various models via OpenRouter.
Please ensure you have valid API keys for the models you intend to use.
You can use evaluation/evaluate.py to run evaluations for closed-source or API-served models. This script supports two modes:
modify(default): re-evaluate missing/failed entries in existing results.evaluate: run a fresh evaluation.
The basic command structure is as follows:
python evaluation/evaluate.py \
--model_list "qwen2.5-vl-3b-instruct" "gpt-4o" \
--benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
--save_dir "path/to/your/results_directory" \
--data_file "SpatialViz_Bench_test.json" \
--run_mode "evaluate"Common optional flags:
--text_only: skip images and run text-only prompts--use_direct_answer: use direct-answer prompts when available--choice_prompt/--direct_prompt: select prompt keys or pass raw prompt text--enable_sampling --sample_per_level N --sample_seed S: subsample per (Category, Task, Level)--logprobs --top_logprobs K: request log probabilities (if the API supports it)
You can use the evaluation/evaluate_xxx.py scripts to run evaluations for specific open-source models. Available scripts:
evaluation/evaluate_deepseekvl.pyevaluation/evaluate_internvl.pyevaluation/evaluate_kimivl.pyevaluation/evaluate_llava_ov.pyevaluation/evaluate_sail.py
The basic command structure is as follows:
python evaluation/evaluate_xxxvl.py \
--model_paths "path/to/download/xxx/models" \
--benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
--results_dir "path/to/your/results_directory" \
--data_file "SpatialViz_Bench_test.json"These scripts share common flags such as --run_mode, --text_only, --enable_sampling, and --enable_tail_fallback.
The get_answer function in evaluate.py processes a results file (in JSONL format) generated by model inference. Its main purposes are:
- Extracting Answers: It parses the model's output to identify the predicted answer (A, B, C, or D) for each question. It can handle outputs with and without explicit
<answer>tags, attempting to find the answer even in less structured responses. - Calculating Accuracy:
- It compares the predicted answer with the ground truth answer.
- It calculates and stores accuracy at different granularities:
overall: Accuracy across all test instances.category: Accuracy for each main category in the benchmark.task: Accuracy for each specific task type.level: Accuracy for combinedcategory-task-levelinstances.
- Recording Samples:
- It separates the evaluated instances into
positives(correctly answered) andnegatives(incorrectly answered). - Each sample in these lists includes the
DataID,InputText,Answer(ground truth), and the model'sResponseorThinkingProcessandFinalAnswer.
- It separates the evaluated instances into
- Saving Results:
- Counting File: It saves the accuracy statistics (number of correct predictions, total number of predictions, and accuracy percentage) for overall, category, task, and level into a JSON file (e.g.,
results_MODELNAME_counting.json) in the specifiedcountingsubdirectory. - Samples File: It saves the lists of positive and negative samples into a separate JSON file (e.g.,
results_MODELNAME_samples.json) in the specifiedsamplessubdirectory.
- Counting File: It saves the accuracy statistics (number of correct predictions, total number of predictions, and accuracy percentage) for overall, category, task, and level into a JSON file (e.g.,
If you use SpatialViz-Bench in your research, please cite our paper:
@misc{wang2026spatialvizbenchcognitivelygroundedbenchmarkdiagnosing,
title={SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs},
author={Siting Wang and Minnan Pei and Luoyang Sun and Cheng Deng and Kun Shao and Zheng Tian and Haifeng Zhang and Jun Wang},
year={2026},
eprint={2507.07610},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.07610},
}