Self-Bench

This project provides a framework for evaluating diffusion models on a wide range of image-text matching tasks. It supports multiple Stable Diffusion versions and various benchmark datasets.

Datasets

Self-Bench on Hugging Face

Supported Models

The framework supports the following model versions:

1.5: Stable Diffusion v1.5
2.0: Stable Diffusion v2.0
3-m: Stable Diffusion 3 Medium
3-lt: Stable Diffusion 3 Large Turbo (distilled model)
flux: Flux model

Available Tasks

The framework supports numerous benchmark tasks including:

Compositional Tasks

cola_multi: Compositional Language tasks
vg_relation: Visual Genome relations
vg_attribution: Visual Genome attributes
coco_order: COCO ordering tasks
winoground: Winoground benchmark
flickr30k: Flickr30k dataset
flickr30k_text: Flickr30k text-only evaluation
imagenet: ImageNet evaluation
clevr: CLEVR dataset
pets: Oxford Pets dataset

Specialized Tasks

mmbias: Multimodal bias evaluation
genderbias: Gender bias evaluation
eqbench: Equality benchmark
vismin: Visual minority evaluation

Geneval Tasks

Color Tasks
- geneval_color: Basic color understanding
- geneval_color_attr: Color attribution
Position Tasks
- geneval_position: Spatial understanding
Counting Tasks
- geneval_counting: Object counting
Object Tasks
- geneval_single: Single object understanding
- geneval_two: Two object understanding
- geneval_two_subset: Subset of two object tasks

Usage

Basic Usage

python diffusion_itm.py --task TASK_NAME --version MODEL_VERSION

Common Parameters

--task: Specify the benchmark task (required)
--version: Model version to use (required)
--batchsize: Batch size (default: 64)
--sampling_steps: Number of sampling steps (default: 30)
--guidance_scale: Guidance scale for generation (default: 0.0)
--img_retrieval: Enable image retrieval mode
--encoder_drop: Drop encoder for certain tasks
--save: Save results
--wandb: Enable Weights & Biases logging

Examples

Basic evaluation with SD 3 Medium:

python diffusion_itm.py --task winoground --version 3-m

Compositional evaluation with specific subset:

python diffusion_itm.py --task cola_multi --version compdiff --comp_subset color

Image retrieval mode:

python diffusion_itm.py --task flickr30k --version 2.0 --img_retrieval

Full evaluation with saving:

python diffusion_itm.py --task clevr --version 3-m --save --save_results --wandb --batchsize 16

Dataset Configuration

When running experiments with Self-bench datasets, you need to specify:

The model version (--geneval_version): Choose from "1.5", "2.0", "3-m", or "flux"
The CFG value (--geneval_cfg): Default is 9.0
The filter flag (--geneval_filter): Set to "True" or "False"

Example command for running a Geneval task:

python diffusion_itm.py --task geneval_color --version 2.0 --geneval_version 2.0 --geneval_cfg 9.0 --geneval_filter True

Dataset Structure

The dataset should be organized as follows:

dataset_root/
├── 9.0/  # CFG value
│   ├── stable-diffusion-v1-5/
│   ├── stable-diffusion-2-base/
│   └── stable-diffusion-3-medium-diffusers/
├── prompts/
│   └── zero_shot_prompts.json
└── filter/
    └── SD-{version}-CFG={cfg}.json

Project Structure

.
├── diffusion_itm.py          # Main evaluation script
├── datasets_loading.py       # Dataset loading utilities
├── utils.py                  # Utility functions
├── results/                  # Output directory for results
└── diffusers/               # Modified diffusers library
    └── src/
        └── diffusers/
            └── schedulers/   # Custom schedulers

Requirements

Python 3.8+
PyTorch
diffusers
transformers
wandb (optional, for logging)
accelerate

Notes

For SD3 models, you can use --sd3_resize to enable 512x512 resizing
Use --use_normed_classifier for normalized classifier evaluation
For compositional tasks, specify the subset using --comp_subset (color, shape, texture, complex, spatial, non_spatial)

Citation

If you use this code in your research, please cite the original self-bench repository and this work.

License

This project is licensed under the same terms as the original McGill-NLP/diffusion-itm repository, upon which it builds. Please refer to their license for more details.

Original Work Attribution

This project builds upon the work from the McGill-NLP/diffusion-itm repository. We gratefully acknowledge their contributions to the field.

Please find the original repository here: https://github.com/McGill-NLP/diffusion-itm

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
_analysis		_analysis
analysis_cfg		analysis_cfg
aro		aro
data/svo		data/svo
diffusers		diffusers
filter_prompts/prompts		filter_prompts/prompts
temporal/temporal2		temporal/temporal2
vision-language-models-are-bows		vision-language-models-are-bows
whatsup_vlms2		whatsup_vlms2
README.md		README.md
agreement.py		agreement.py
agreement_analysis.py		agreement_analysis.py
analysis_ours.py		analysis_ours.py
analyze_confusion.py		analyze_confusion.py
bar_plot_discffusion.py		bar_plot_discffusion.py
bar_plot_full_correct.py		bar_plot_full_correct.py
bar_plot_full_correct2.py		bar_plot_full_correct2.py
bar_plot_teaser.py		bar_plot_teaser.py
bar_plot_teaser_combined.py		bar_plot_teaser_combined.py
bar_plot_teaser_ours.py		bar_plot_teaser_ours.py
barplot_geneval.py		barplot_geneval.py
barplot_ours_average.py		barplot_ours_average.py
barplot_ours_shortened.py		barplot_ours_shortened.py
barplot_ours_shortened_all.py		barplot_ours_shortened_all.py
barplot_teaser_final.py		barplot_teaser_final.py
boxplot_macro_ours.py		boxplot_macro_ours.py
boxplot_others.py		boxplot_others.py
boxplot_others2.py		boxplot_others2.py
boxplot_others_shortened.py		boxplot_others_shortened.py
boxplot_others_shortened2.py		boxplot_others_shortened2.py
boxplot_ours.py		boxplot_ours.py
boxplot_ours2.py		boxplot_ours2.py
boxplot_ours_others.py		boxplot_ours_others.py
boxplot_ours_others_shortened.py		boxplot_ours_others_shortened.py
boxplot_ours_shortened2.py		boxplot_ours_shortened2.py
compress_pdf.py		compress_pdf.py
corr.py		corr.py
correlation.py		correlation.py
dataset_loading_geneval.py		dataset_loading_geneval.py
datasets_loading.py		datasets_loading.py
diffusion_itm.py		diffusion_itm.py
generation.ipynb		generation.ipynb
get_macro.py		get_macro.py
hard_neg_finetuning.py		hard_neg_finetuning.py
kind_of_globals.py		kind_of_globals.py
main_aro.py		main_aro.py
main_retrieval.py		main_retrieval.py
micro_average_color.py		micro_average_color.py
micro_average_counting.py		micro_average_counting.py
micro_average_position.py		micro_average_position.py
our_benchmark.py		our_benchmark.py
setup.sh		setup.sh
table.py		table.py
test.py		test.py
two_object_subset.py		two_object_subset.py
utils.py		utils.py
whatsup.py		whatsup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Bench

Datasets

Supported Models

Available Tasks

Compositional Tasks

Specialized Tasks

Geneval Tasks

Usage

Basic Usage

Common Parameters

Examples

Dataset Configuration

Dataset Structure

Project Structure

Requirements

Notes

Citation

License

Original Work Attribution

About

Uh oh!

Releases

Packages

Languages

eugene6923/Diffusion-Classifiers-Compositionality

Folders and files

Latest commit

History

Repository files navigation

Self-Bench

Datasets

Supported Models

Available Tasks

Compositional Tasks

Specialized Tasks

Geneval Tasks

Usage

Basic Usage

Common Parameters

Examples

Dataset Configuration

Dataset Structure

Project Structure

Requirements

Notes

Citation

License

Original Work Attribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages