EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Want to run EdiVal-Agent on your own images? Jump to the Bring Your Own Images section for a step-by-step walkthrough.
Project Website • Hugging Face Repository
Welcome to the official repository for EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing (arXiv:2509.13399). With the toolkit in this repo you can generate fresh multi-turn instructions, run your own editing models (or ours) against the benchmark, evaluate instruction-following, consistency, and quality across turns, and reproduce every experiment from the paper with the provided scripts and notebooks.
Table of Contents
- Goal: benchmark instruction-following, consistency and perceptual quality in sequential (multi-turn) image editing.
- Inputs: 512×512 images and curated 3-turn editing instructions.
- Outputs: multipass & singlepass generations, instruction-following scores, consistency metrics, and quality scores (including optional HPSv3).
env_setup/– Conda environment specification (env.yaml) and bootstrap script (setup_edival.sh).generate_instructions/– Full instruction pipeline: object parsing, grounding filter, CSV export, and candidate pools.generate.py– Use your editor to generated editted images, with Qwen-Image-Edit model as an example.baseline_generate/– Historical baseline scripts retained for comparison, including GPT-Image-1, Flux models.detector/– Instruction-following, consistency, and quality evaluation modules.eval.py/eval_bash.sh– Core evaluator and batch helper.example_evaluate_results/– Reference outputs for sanity checking. Your output should have the similar structure.analysis.ipynb– Notebook used to analysis your final results inexample_evaluate_results/.oai_instruction_generation_output.csv– Sample 3-turn instruction CSV. The instructions we used in our paper.update_hps_scores.py– Utility to backfill HPSv3 scores into evaluation JSONs. If you need hpsv3 quality score, use this scripts, since hpsv3 score env conflicts with other metrics.
# 1. Create the environment (once)
bash env_setup/setup_edival.sh
# 2. Activate for every new shell
conda activate edivalThe bootstrap script installs PyTorch (CUDA 12.1), GroundingDINO (editable mode), diffusers, vLLM dependencies, and all evaluation packages. Modify env_setup/env.yaml if you need different versions.
All assets live in the Hugging Face repository C-Tianyu/EdiVal:
input_images_resize_512.zip– canonical 512×512 image set.baseline_generations/*– pre-generated outputs: GPT-Image-1, Nano Banana, SeedDream v4, etc.
Download the resources you need and place them at the repository root (paths can be overridden via CLI flags).
All scripts reside in generate_instructions/. Candidate vocabularies are stored in generate_instructions/candidate_pools/.
Before you start: set
OPENAI_API_KEY(and optionallyOPENAI_API_BASEif you use a custom endpoint).
-
Object Extraction
export OPENAI_API_KEY=sk-... python generate_instructions/oai_all_objects.py \ --input-dir input_images_resize_512 \ --output-dir generate_instructions/oai_all_objectsProduces rich JSON metadata (
<index>_input_raw.json) for every image. -
Grounding Filter
python generate_instructions/grounding_filter.py \ --input-dir generate_instructions/oai_all_objects \ --output-dir generate_instructions/grounding_all_objects \ --image-dir input_images_resize_512 \ --num-gpus 2 \ --box-threshold 0.35 \ --text-threshold 0.35
Uses GroundingDINO to keep only visually grounded objects, adding bounding boxes and counts.
-
CSV Export
export OPENAI_API_KEY=sk-... python generate_instructions/oai_instruction_generator.py \ --grounding-dir generate_instructions/grounding_all_objects \ --input-images input_images_resize_512 \ --output oai_instruction_generation_output.csv \ --seed 42Generates the multi-turn instruction CSV used downstream. Candidate pools come from
generate_instructions/candidate_pools/*.txt; regenerate them withgenerate_instructions/candidate_pools/generate_objects_txt.pyif you need to refresh vocabularies.
Run Qwen/Qwen-Image-Edit (or your own editor) over the instructions:
python generate.py \
--csv oai_instruction_generation_output.csv \
--zip input_images_resize_512.zip \
--output-dir your_generations \
--num-gpus 2Outputs land in your_generations/multipass and your_generations/singlepass. To plug in a custom model, implement a class that exposes generate_single_edit and point to it via:
python generate.py --editor-class my_module:MyCustomGenerator ...-
Single Folder
python eval.py \ --generation_folder your_generations \ --modes multipass singlepass
Computes instruction-following, consistency, and quality metrics (HPSv3 if available) and writes JSON summaries to
evaluate_results/your_generations/. -
Batch Mode
bash eval_bash.sh
Adjust
BASE_DIR,JOBS, andGPU_GROUPSat the top of the script to suit your hardware; the helper loops over all subfolders inBASE_DIR. -
Fill in HPSv3 (Optional)
python update_hps_scores.py \ --results_root evaluate_results/your_generations \ --num_gpus 2 \ --batch_size 4
Backfills missing HPSv3 scores into the evaluation JSONs.
analysis.ipynb– Aggregates evaluation outputs and generates the plots/tables from the paper.example_evaluate_results/– Reference outputs to verify your setup.
Launch Jupyter or VS Code within the edival environment to explore the results interactively.
Want to evaluate your own dataset? Follow the same three-stage pipeline used for EdiVal's release.
-
Prepare Inputs
- Collect/source your raw images (ideally resized to 512×512 for parity with the benchmark).
- Place the ZIP (or directory) alongside the existing
input_images_resize_512.zipor point the scripts to your custom location via CLI flags.
-
Generate Instructions
- Run
generate_instructions/oai_all_objects.pyto extract object metadata for every image. - Filter the results with
generate_instructions/grounding_filter.pyto keep grounded objects. - Export multi-turn instructions using
generate_instructions/oai_instruction_generator.py, targeting a new CSV (e.g.my_dataset_instructions.csv).
- Run
-
Run Editors
- Use
generate.py(or your own editor runner) to create multipass/singlepass generations from the new instruction CSV. - Custom editors can be plugged in via
--editor-classif you want to benchmark alternative diffusion or autoregressive systems.
- Use
-
Evaluate
- Invoke
eval.py --generation_folder <path>(and optionallyeval_bash.shfor batch jobs) to produce instruction-following, consistency, and quality metrics. - If desired, run
update_hps_scores.pyto backfill HPSv3 scores. - Feed the outputs into
analysis.ipynbto generate the same tables/figures used in the paper.
- Invoke
That's it—by mirroring the release pipeline you can obtain apples-to-apples results for any custom image collection.
@article{ediValAgent2025,
title = {EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing},
author = {Tianyu Chen and Yasi Zhang and Zhi Zhang and Peiyu Yu and Shu Wang and Zhendong Wang and Kevin Lin and Xiaofei Wang and Zhengyuan Yang and Linjie Li and Chung-Ching Lin and Jianwen Xie and Oscar Leong and Lijuan Wang and Ying Nian Wu and Mingyuan Zhou},
journal = {arXiv preprint arXiv:2509.13399},
year = {2025}
}
- Latest checkpoints, instruction files, and baseline generations: Hugging Face
C-Tianyu/EdiVal - Project updates and interactive demos: EdiVal Website
- Pull requests and issues are welcome if you build new editors, metrics, or analysis tools on top of this codebase.
