We introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost.
git clone https://github.com/OpenCausaLab/CauSight.git
cd CauSightWe recommend using conda:
conda create -n causight python=3.11
conda activate causight
pip install -r requirements.txt
pip install -e .mkdir -p VCG-32K
pip install huggingface_hub
hf login
hf download OpenCausaLab/VCG-32K \
--repo-type dataset \
--local-dir ./VCG-32Ktar -xzf ./VCG-32K/COCO/images.tar.gz -C ./VCG-32K/COCO
tar -xzf ./VCG-32K/365/images.tar.gz -C ./VCG-32K/365mkdir -p model
huggingface-cli download OpenCausaLab/CauSight \
--repo-type model \
--local-dir ./modelStart the model server, then run inference:
bash model_server.sh
python run_inference.pyIf you want to make your own SFT data with Tree-of-Causal-Thought, run:
bash model_server.sh
python run.py@article{zhang2025causight,
title={CauSight: Learning to Supersense for Visual Causal Discovery},
author={Zhang, Yize and Chen, Meiqi and Chen, Sirui and Peng, Bo and Zhang, Yanxi and Li, Tianyu and Lu, Chaochao},
journal={arXiv preprint arXiv:2512.01827},
year={2025}
}
