SpatialViz-Bench

This project is designed to evaluate the performance of multimodal large language models on the SpatialViz-Bench benchmark.

News

2025.5.28 Released SpatialViz-Bench, the first benchmark to evaluate spatial visualization for MLLMs.
2026.1.5 EASI (Holistic Evaluation of Multimodal LLMs on Spatial Intelligence) integrated SpatialViz-Bench into their open-source evaluation platform. EASI on GitHub
2026.1.26 🎉 Accepted as a poster at ICLR 2026.
2026.2.2 Released the code for generating data.

Installation

Clone the Repository
Create and Activate a Virtual Environment and Install Dependencies:

You should follow all the requirements specified in the code repositories of the open-source models when setting up the environment. For the evaluation of closed-source models, only the following packages are required.
```
openai
datasets
tqdm
```

Evaluation

Configuration for Closed-Source Models

Before running the script, you may need to configure API keys. The script accepts these keys via command-line arguments.

Qwen API Key: For accessing Qwen series models.
Doubao API Key: For accessing Doubao series models.
OpenAI API Key: For accessing OpenAI models (e.g., GPT-4o).
Gemini API Key: For accessing Gemini series models.
OpenRouter API Key: For accessing various models via OpenRouter.

Please ensure you have valid API keys for the models you intend to use.

Running Evaluations (Closed-source / API Models)

You can use evaluation/evaluate.py to run evaluations for closed-source or API-served models. This script supports two modes:

modify (default): re-evaluate missing/failed entries in existing results.
evaluate: run a fresh evaluation.

The basic command structure is as follows:

python evaluation/evaluate.py \
    --model_list "qwen2.5-vl-3b-instruct" "gpt-4o" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --save_dir "path/to/your/results_directory" \
    --data_file "SpatialViz_Bench_test.json" \
    --run_mode "evaluate"

Common optional flags:

--text_only: skip images and run text-only prompts
--use_direct_answer: use direct-answer prompts when available
--choice_prompt / --direct_prompt: select prompt keys or pass raw prompt text
--enable_sampling --sample_per_level N --sample_seed S: subsample per (Category, Task, Level)
--logprobs --top_logprobs K: request log probabilities (if the API supports it)

Running Evaluations (Open-source / Local Models)

You can use the evaluation/evaluate_xxx.py scripts to run evaluations for specific open-source models. Available scripts:

evaluation/evaluate_deepseekvl.py
evaluation/evaluate_internvl.py
evaluation/evaluate_kimivl.py
evaluation/evaluate_llava_ov.py
evaluation/evaluate_sail.py

The basic command structure is as follows:

python evaluation/evaluate_xxxvl.py \
    --model_paths "path/to/download/xxx/models" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --results_dir "path/to/your/results_directory" \
    --data_file "SpatialViz_Bench_test.json"

These scripts share common flags such as --run_mode, --text_only, --enable_sampling, and --enable_tail_fallback.

Extract Answer from Results

The get_answer function in evaluate.py processes a results file (in JSONL format) generated by model inference. Its main purposes are:

Extracting Answers: It parses the model's output to identify the predicted answer (A, B, C, or D) for each question. It can handle outputs with and without explicit <answer> tags, attempting to find the answer even in less structured responses.
Calculating Accuracy:
- It compares the predicted answer with the ground truth answer.
- It calculates and stores accuracy at different granularities:
  - overall: Accuracy across all test instances.
  - category: Accuracy for each main category in the benchmark.
  - task: Accuracy for each specific task type.
  - level: Accuracy for combined category-task-level instances.
Recording Samples:
- It separates the evaluated instances into positives (correctly answered) and negatives (incorrectly answered).
- Each sample in these lists includes the DataID, InputText, Answer (ground truth), and the model's Response or ThinkingProcess and FinalAnswer.
Saving Results:
- Counting File: It saves the accuracy statistics (number of correct predictions, total number of predictions, and accuracy percentage) for overall, category, task, and level into a JSON file (e.g., results_MODELNAME_counting.json) in the specified counting subdirectory.
- Samples File: It saves the lists of positive and negative samples into a separate JSON file (e.g., results_MODELNAME_samples.json) in the specified samples subdirectory.

Citation

If you use SpatialViz-Bench in your research, please cite our paper:

@misc{wang2026spatialvizbenchcognitivelygroundedbenchmarkdiagnosing,
      title={SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs}, 
      author={Siting Wang and Minnan Pei and Luoyang Sun and Cheng Deng and Kun Shao and Zheng Tian and Haifeng Zhang and Jun Wang},
      year={2026},
      eprint={2507.07610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07610}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
create_freecad		create_freecad
docs		docs
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpatialViz-Bench

News

Table of Contents

Installation

Evaluation

Configuration for Closed-Source Models

Running Evaluations (Closed-source / API Models)

Running Evaluations (Open-source / Local Models)

Extract Answer from Results

Citation

About

Uh oh!

Releases

Packages

Languages

License

wangst0181/SpatialViz-Bench

Folders and files

Latest commit

History

Repository files navigation

SpatialViz-Bench

News

Table of Contents

Installation

Evaluation

Configuration for Closed-Source Models

Running Evaluations (Closed-source / API Models)

Running Evaluations (Open-source / Local Models)

Extract Answer from Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages