Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

Ouxiang Li^1*, Yuan Wang¹, Xinting Hu^†, Huijuan Huang^2‡, Rui Chen², Jiarong Ou²,
Xin Tao^2†, Pengfei Wan², Xiaojuan Qi¹, Fuli Feng¹

¹University of Science and Technology of China, ²Kling Team, Kuaishou Technology, ³The University of Hong Kong
^*Work done during internship at Kling Team, Kuaishou Technology. ^†Corresponding authors. ^†Project lead.

Overview of our T2I-CoReBench. (a) Our benchmark comprehensively covers two fundamental T2I capabilities (i.e., composition and reasoning), further refined into 12 dimensions. (b–e) Our benchmark poses greater challenges to advanced T2I models, with higher compositional density than DPG-Bench and greater reasoning intensity than R2I-Bench, enabling clearer performance differentiation across models under real-world complexities. Each image is scored based on the ratio of correctly generated elements.

📣 News

2026/01 🌟 We have updated the evaluation results of GPT-Image-1.5.
2026/01 🌟 We have optimized evaluate.py to improve evaluation efficiency for open-source evaluators and updated the human alignment study results (see 📏 Run Evaluation).
2026/01 🌟 We have updated the evaluation results of Qwen-Image-2512.
2025/12 🌟 We have updated the evaluation results of FLUX.2-dev and LongCat-Image.
2025/12 🌟 We have updated the evaluation results of HunyuanImage-3.0 and Z-Image-Turbo.
2025/11 🌟 We have updated the evaluation results of 🍌 Nano Banana Pro, which achieves a new SOTA across all 12 dimensions by a substantial margin (see our 🏆 Leaderboard for more details).
2025/10 🌟 We have integrated the Qwen3-VL series MLLMs into evaluate.py.
2025/09 🌟 We have updated the evaluation results of Seedream 4.0.
2025/09 🌟 We have released our benchmark dataset and code.

Benchmark Comparison

T2I-CoReBench comprehensively covers 12 evaluation dimensions spanning both composition and reasoning scenarios. The symbols indicate different coverage levels: ● means coverage with high compositional (visual elements > 5) or reasoning (one-to-many or many-to-one inference) complexity. ◐ means coverage under simple settings (visual elements ≤ 5 or one-to-one inference). ○ means this dimension is not covered.

🚀 Quick Start

To evaluate text-to-image models on our T2I-CoReBench, follow these steps:

🖼️ Generate Images

Use the provided script to generate images from the benchmark prompts in ./data. You can customize the T2I models by editing MODELS and adjust GPU usage by setting GPUS. Here, we take Qwen-Image as an example, and the corresponding Python environment can be referred to its official repository.

bash sample.sh

If you wish to sample with your own model, simply modify the sampling code in sample.py, i.e., the model loading part in lines 44–72 and the sampling part in line 94; no other changes are required.

📏 Run Evaluation

We provide evaluation code supporting various MLLMs, including Gemini 2.5 Flash (used in our main paper) and the Qwen series (complementary open-source evaluators), both of which are used to assess the generated images in our benchmark.

Note

If Gemini 2.5 Flash is not available due to closed-source API costs, we recommend using Qwen3-VL-32B-Thinking or Qwen3-VL-30B-A3B-Thinking as alternatives. Both models offer a strong balance between human consistency and computational cost in open-source MLLMs (see Table below). Qwen3-VL-30B-A3B-Thinking is more efficient due to its MoE architecture, making it a more cost-effective choice. Comprehensive evaluation results for different MLLM evaluators are available in our 🏆 Leaderboard.

For the Gemini series, please refer to the Gemini documentation for environment setup. An official API key GEMINI_API_KEY should be set as an environment variable in evaluate.py. For the Qwen series, please follow the vLLM User Guide and consult their official repository for environment setup.

bash eval.sh

The evaluation process will automatically assess the generated images across all 12 dimensions of our benchmark and provide a mean_score for each dimension in an individual json file.

Table: Human alignment study using balanced accuracy (%) and GPU (80GB) requirement for different MLLMs.

MLLM	MI	MA	MR	TR	Mean	#GPUs
Qwen2.5-VL-72B-Instruct	81.3	63.1	64.2	73.7	70.6	4
InternVL3-78B	70.8	56.8	56.5	67.7	62.9	4
GLM4.5V-106B	78.0	61.3	60.3	71.8	67.8	4
Qwen3-VL-8B-Instruct	72.0	56.2	56.6	65.4	62.5	1
Qwen3-VL-8B-Thinking	79.6	68.9	70.7	76.2	73.8	1
Qwen3-VL-32B-Instruct	80.8	63.4	60.6	73.3	69.5	2
Qwen3-VL-32B-Thinking	81.9	72.9	75.4	79.8	77.5	2
Qwen3-VL-30B-A3B-Instruct	83.1	61.9	59.1	74.2	69.6	2
Qwen3-VL-30B-A3B-Thinking	82.5	73.9	75.4	77.7	77.4	2
GPT-4o	78.3	67.5	63.6	72.0	70.3	-
OpenAI o3	83.5	77.8	80.4	86.8	82.1	-
OpenAI o4 mini	81.9	74.7	77.0	83.0	79.1	-
Gemini 2.5 Pro	83.4	76.5	82.2	88.4	82.6	-
Gemini 2.5 Flash	83.8	76.9	78.0	85.7	81.1	-
Gemini 2.5 Flash Lite	69.1	60.1	58.0	74.5	65.4	-
Gemini 2.0 Flash	73.5	61.0	67.7	77.1	69.8	-

📊 Examples of Each Dimension

✍️ Citation

If you find the repo useful, please consider citing.

@article{li2025easier,
  title={Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?},
  author={Li, Ouxiang and Wang, Yuan and Hu, Xinting and Huang, Huijuan and Chen, Rui and Ou, Jiarong and Tao, Xin and Wan, Pengfei and Qi, Xiaojuan and Feng, Fuli},
  journal={arXiv preprint arXiv:2509.03516},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
assets		assets
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

📣 News

Benchmark Comparison

🚀 Quick Start

🖼️ Generate Images

📏 Run Evaluation

📊 Examples of Each Dimension

✍️ Citation

About

Uh oh!

Releases

Packages

Languages

License

KlingTeam/T2I-CoReBench

Folders and files

Latest commit

History

Repository files navigation

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

📣 News

Benchmark Comparison

🚀 Quick Start

🖼️ Generate Images

📏 Run Evaluation

📊 Examples of Each Dimension

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

Packages