Skip to content

wangst0181/SpatialViz-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpatialViz-Bench

This project is designed to evaluate the performance of multimodal large language models on the SpatialViz-Bench benchmark.

News

  • 2025.5.28 Released SpatialViz-Bench, the first benchmark to evaluate spatial visualization for MLLMs.
  • 2026.1.5 EASI (Holistic Evaluation of Multimodal LLMs on Spatial Intelligence) integrated SpatialViz-Bench into their open-source evaluation platform. EASI on GitHub
  • 2026.1.26 🎉 Accepted as a poster at ICLR 2026.
  • 2026.2.2 Released the code for generating data.

Table of Contents

Installation

  1. Clone the Repository

  2. Create and Activate a Virtual Environment and Install Dependencies:

    You should follow all the requirements specified in the code repositories of the open-source models when setting up the environment. For the evaluation of closed-source models, only the following packages are required.

    openai
    datasets
    tqdm

Evaluation

Configuration for Closed-Source Models

Before running the script, you may need to configure API keys. The script accepts these keys via command-line arguments.

  • Qwen API Key: For accessing Qwen series models.
  • Doubao API Key: For accessing Doubao series models.
  • OpenAI API Key: For accessing OpenAI models (e.g., GPT-4o).
  • Gemini API Key: For accessing Gemini series models.
  • OpenRouter API Key: For accessing various models via OpenRouter.

Please ensure you have valid API keys for the models you intend to use.

Running Evaluations (Closed-source / API Models)

You can use evaluation/evaluate.py to run evaluations for closed-source or API-served models. This script supports two modes:

  • modify (default): re-evaluate missing/failed entries in existing results.
  • evaluate: run a fresh evaluation.

The basic command structure is as follows:

python evaluation/evaluate.py \
    --model_list "qwen2.5-vl-3b-instruct" "gpt-4o" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --save_dir "path/to/your/results_directory" \
    --data_file "SpatialViz_Bench_test.json" \
    --run_mode "evaluate"

Common optional flags:

  • --text_only: skip images and run text-only prompts
  • --use_direct_answer: use direct-answer prompts when available
  • --choice_prompt / --direct_prompt: select prompt keys or pass raw prompt text
  • --enable_sampling --sample_per_level N --sample_seed S: subsample per (Category, Task, Level)
  • --logprobs --top_logprobs K: request log probabilities (if the API supports it)

Running Evaluations (Open-source / Local Models)

You can use the evaluation/evaluate_xxx.py scripts to run evaluations for specific open-source models. Available scripts:

  • evaluation/evaluate_deepseekvl.py
  • evaluation/evaluate_internvl.py
  • evaluation/evaluate_kimivl.py
  • evaluation/evaluate_llava_ov.py
  • evaluation/evaluate_sail.py

The basic command structure is as follows:

python evaluation/evaluate_xxxvl.py \
    --model_paths "path/to/download/xxx/models" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --results_dir "path/to/your/results_directory" \
    --data_file "SpatialViz_Bench_test.json"

These scripts share common flags such as --run_mode, --text_only, --enable_sampling, and --enable_tail_fallback.

Extract Answer from Results

The get_answer function in evaluate.py processes a results file (in JSONL format) generated by model inference. Its main purposes are:

  1. Extracting Answers: It parses the model's output to identify the predicted answer (A, B, C, or D) for each question. It can handle outputs with and without explicit <answer> tags, attempting to find the answer even in less structured responses.
  2. Calculating Accuracy:
    • It compares the predicted answer with the ground truth answer.
    • It calculates and stores accuracy at different granularities:
      • overall: Accuracy across all test instances.
      • category: Accuracy for each main category in the benchmark.
      • task: Accuracy for each specific task type.
      • level: Accuracy for combined category-task-level instances.
  3. Recording Samples:
    • It separates the evaluated instances into positives (correctly answered) and negatives (incorrectly answered).
    • Each sample in these lists includes the DataID, InputText, Answer (ground truth), and the model's Response or ThinkingProcess and FinalAnswer.
  4. Saving Results:
    • Counting File: It saves the accuracy statistics (number of correct predictions, total number of predictions, and accuracy percentage) for overall, category, task, and level into a JSON file (e.g., results_MODELNAME_counting.json) in the specified counting subdirectory.
    • Samples File: It saves the lists of positive and negative samples into a separate JSON file (e.g., results_MODELNAME_samples.json) in the specified samples subdirectory.

Citation

If you use SpatialViz-Bench in your research, please cite our paper:

@misc{wang2026spatialvizbenchcognitivelygroundedbenchmarkdiagnosing,
      title={SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs}, 
      author={Siting Wang and Minnan Pei and Luoyang Sun and Cheng Deng and Kun Shao and Zheng Tian and Haifeng Zhang and Jun Wang},
      year={2026},
      eprint={2507.07610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07610}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages