Skip to content

KlingTeam/T2I-CoReBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

Ouxiang Li1*, Yuan Wang1, Xinting Hu, Huijuan Huang2‡, Rui Chen2, Jiarong Ou2,
Xin Tao2†, Pengfei Wan2, Xiaojuan Qi1, Fuli Feng1

1University of Science and Technology of China, 2Kling Team, Kuaishou Technology, 3The University of Hong Kong
*Work done during internship at Kling Team, Kuaishou Technology. Corresponding authors. Project lead.

teaser

Overview of our T2I-CoReBench. (a) Our benchmark comprehensively covers two fundamental T2I capabilities (i.e., composition and reasoning), further refined into 12 dimensions. (b–e) Our benchmark poses greater challenges to advanced T2I models, with higher compositional density than DPG-Bench and greater reasoning intensity than R2I-Bench, enabling clearer performance differentiation across models under real-world complexities. Each image is scored based on the ratio of correctly generated elements.

📣 News

  • 2026/01 🌟 We have updated the evaluation results of GPT-Image-1.5.
  • 2026/01 🌟 We have optimized evaluate.py to improve evaluation efficiency for open-source evaluators and updated the human alignment study results (see 📏 Run Evaluation).
  • 2026/01 🌟 We have updated the evaluation results of Qwen-Image-2512.
  • 2025/12 🌟 We have updated the evaluation results of FLUX.2-dev and LongCat-Image.
  • 2025/12 🌟 We have updated the evaluation results of HunyuanImage-3.0 and Z-Image-Turbo.
  • 2025/11 🌟 We have updated the evaluation results of 🍌 Nano Banana Pro, which achieves a new SOTA across all 12 dimensions by a substantial margin (see our 🏆 Leaderboard for more details).
  • 2025/10 🌟 We have integrated the Qwen3-VL series MLLMs into evaluate.py.
  • 2025/09 🌟 We have updated the evaluation results of Seedream 4.0.
  • 2025/09 🌟 We have released our benchmark dataset and code.

Benchmark Comparison

benchmark_comparison

T2I-CoReBench comprehensively covers 12 evaluation dimensions spanning both composition and reasoning scenarios. The symbols indicate different coverage levels: means coverage with high compositional (visual elements > 5) or reasoning (one-to-many or many-to-one inference) complexity. means coverage under simple settings (visual elements ≤ 5 or one-to-one inference). means this dimension is not covered.

🚀 Quick Start

To evaluate text-to-image models on our T2I-CoReBench, follow these steps:

🖼️ Generate Images

Use the provided script to generate images from the benchmark prompts in ./data. You can customize the T2I models by editing MODELS and adjust GPU usage by setting GPUS. Here, we take Qwen-Image as an example, and the corresponding Python environment can be referred to its official repository.

bash sample.sh

If you wish to sample with your own model, simply modify the sampling code in sample.py, i.e., the model loading part in lines 44–72 and the sampling part in line 94; no other changes are required.

📏 Run Evaluation

We provide evaluation code supporting various MLLMs, including Gemini 2.5 Flash (used in our main paper) and the Qwen series (complementary open-source evaluators), both of which are used to assess the generated images in our benchmark.

Note

If Gemini 2.5 Flash is not available due to closed-source API costs, we recommend using Qwen3-VL-32B-Thinking or Qwen3-VL-30B-A3B-Thinking as alternatives. Both models offer a strong balance between human consistency and computational cost in open-source MLLMs (see Table below). Qwen3-VL-30B-A3B-Thinking is more efficient due to its MoE architecture, making it a more cost-effective choice. Comprehensive evaluation results for different MLLM evaluators are available in our 🏆 Leaderboard.

For the Gemini series, please refer to the Gemini documentation for environment setup. An official API key GEMINI_API_KEY should be set as an environment variable in evaluate.py. For the Qwen series, please follow the vLLM User Guide and consult their official repository for environment setup.

bash eval.sh

The evaluation process will automatically assess the generated images across all 12 dimensions of our benchmark and provide a mean_score for each dimension in an individual json file.

Table: Human alignment study using balanced accuracy (%) and GPU (80GB) requirement for different MLLMs.
MLLMMIMAMRTRMean#GPUs
Qwen2.5-VL-72B-Instruct81.363.164.273.770.64
InternVL3-78B70.856.856.567.762.94
GLM4.5V-106B78.061.360.371.867.84
Qwen3-VL-8B-Instruct72.056.256.665.462.51
Qwen3-VL-8B-Thinking79.668.970.776.273.81
Qwen3-VL-32B-Instruct80.863.460.673.369.52
Qwen3-VL-32B-Thinking81.972.975.479.877.52
Qwen3-VL-30B-A3B-Instruct83.161.959.174.269.62
Qwen3-VL-30B-A3B-Thinking82.573.975.477.777.42
GPT-4o78.367.563.672.070.3-
OpenAI o383.577.880.486.882.1-
OpenAI o4 mini81.974.777.083.079.1-
Gemini 2.5 Pro83.476.582.288.482.6-
Gemini 2.5 Flash83.876.978.085.781.1-
Gemini 2.5 Flash Lite69.160.158.074.565.4-
Gemini 2.0 Flash73.561.067.777.169.8-

📊 Examples of Each Dimension













✍️ Citation

If you find the repo useful, please consider citing.

@article{li2025easier,
  title={Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?},
  author={Li, Ouxiang and Wang, Yuan and Hu, Xinting and Huang, Huijuan and Chen, Rui and Ou, Jiarong and Tao, Xin and Wan, Pengfei and Qi, Xiaojuan and Feng, Fuli},
  journal={arXiv preprint arXiv:2509.03516},
  year={2025}
}

About

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published