This project provides a framework for evaluating diffusion models on a wide range of image-text matching tasks. It supports multiple Stable Diffusion versions and various benchmark datasets.
The framework supports the following model versions:
1.5: Stable Diffusion v1.52.0: Stable Diffusion v2.03-m: Stable Diffusion 3 Medium3-lt: Stable Diffusion 3 Large Turbo (distilled model)flux: Flux model
The framework supports numerous benchmark tasks including:
cola_multi: Compositional Language tasksvg_relation: Visual Genome relationsvg_attribution: Visual Genome attributescoco_order: COCO ordering taskswinoground: Winoground benchmarkflickr30k: Flickr30k datasetflickr30k_text: Flickr30k text-only evaluationimagenet: ImageNet evaluationclevr: CLEVR datasetpets: Oxford Pets dataset
mmbias: Multimodal bias evaluationgenderbias: Gender bias evaluationeqbench: Equality benchmarkvismin: Visual minority evaluation
- Color Tasks
geneval_color: Basic color understandinggeneval_color_attr: Color attribution
- Position Tasks
geneval_position: Spatial understanding
- Counting Tasks
geneval_counting: Object counting
- Object Tasks
geneval_single: Single object understandinggeneval_two: Two object understandinggeneval_two_subset: Subset of two object tasks
python diffusion_itm.py --task TASK_NAME --version MODEL_VERSION--task: Specify the benchmark task (required)--version: Model version to use (required)--batchsize: Batch size (default: 64)--sampling_steps: Number of sampling steps (default: 30)--guidance_scale: Guidance scale for generation (default: 0.0)--img_retrieval: Enable image retrieval mode--encoder_drop: Drop encoder for certain tasks--save: Save results--wandb: Enable Weights & Biases logging
- Basic evaluation with SD 3 Medium:
python diffusion_itm.py --task winoground --version 3-m- Compositional evaluation with specific subset:
python diffusion_itm.py --task cola_multi --version compdiff --comp_subset color- Image retrieval mode:
python diffusion_itm.py --task flickr30k --version 2.0 --img_retrieval- Full evaluation with saving:
python diffusion_itm.py --task clevr --version 3-m --save --save_results --wandb --batchsize 16When running experiments with Self-bench datasets, you need to specify:
- The model version (
--geneval_version): Choose from "1.5", "2.0", "3-m", or "flux" - The CFG value (
--geneval_cfg): Default is 9.0 - The filter flag (
--geneval_filter): Set to "True" or "False"
Example command for running a Geneval task:
python diffusion_itm.py --task geneval_color --version 2.0 --geneval_version 2.0 --geneval_cfg 9.0 --geneval_filter TrueThe dataset should be organized as follows:
dataset_root/
├── 9.0/ # CFG value
│ ├── stable-diffusion-v1-5/
│ ├── stable-diffusion-2-base/
│ └── stable-diffusion-3-medium-diffusers/
├── prompts/
│ └── zero_shot_prompts.json
└── filter/
└── SD-{version}-CFG={cfg}.json
.
├── diffusion_itm.py # Main evaluation script
├── datasets_loading.py # Dataset loading utilities
├── utils.py # Utility functions
├── results/ # Output directory for results
└── diffusers/ # Modified diffusers library
└── src/
└── diffusers/
└── schedulers/ # Custom schedulers
- Python 3.8+
- PyTorch
- diffusers
- transformers
- wandb (optional, for logging)
- accelerate
- For SD3 models, you can use
--sd3_resizeto enable 512x512 resizing - Use
--use_normed_classifierfor normalized classifier evaluation - For compositional tasks, specify the subset using
--comp_subset(color, shape, texture, complex, spatial, non_spatial)
If you use this code in your research, please cite the original self-bench repository and this work.
This project is licensed under the same terms as the original McGill-NLP/diffusion-itm repository, upon which it builds. Please refer to their license for more details.
This project builds upon the work from the McGill-NLP/diffusion-itm repository. We gratefully acknowledge their contributions to the field.
Please find the original repository here: https://github.com/McGill-NLP/diffusion-itm