Result Replication of Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) (Nerulips 2024)

This is the repository reproducing the paper VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 2024. [Original Paper] [Original Repository]

Our Medium article: Medium Article

VLG-CBM provides a novel method to train Concept Bottleneck Models(CBMs) with guidance from both vision and language domain.
VLG-CBM provides concise and accurate concept attribution for the decision made by the model. The following figure compares decision explanation of VLG-CBM with existing methods by listing top-five contributions for their decisions.

Setup

Setup conda environment and install dependencies

  conda create -n vlg-cbm python=3.12
  conda activate vlg-cbm
  pip install -r requirements.txt

(optional) Install Grounding DINO for generating annotations on custom datasets

git clone https://github.com/IDEA-Research/GroundingDINO
cd GroundingDINO
pip install -e .
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth
cd ..

Quick Start

We provide scripts to download and evaluate pretrained models for CIFAR10, CIFAR100, CUB200, Places365, and ImageNet. To quickly evaluate the pretrained models, follow the steps below:

Download pretrained models from here, unzip them, and place them in the saved_models folder.
Run evaluation script to evaluate the pretrained models under different NEC and obtain Accuracy at different NEC (ANEC) for each dataset.

python sparse_evaluation.py --load_path <path-to-model-dir>

For example, to evaluate the pretrained model for CUB200, run

python sparse_evaluation.py --load_path saved_models/cub

Training

Overview

Annotation Generation (Optional)

To train VLG-CBM, images must be annotated with concepts from a Vision-Language model, and this work uses Grounding-DINO for annotation generation. Use the following command to generate annotations for a dataset:

python -m scripts.generate_annotations --dataset <dataset-name> --device cuda --batch_size 32 --text_threshold 0.15 --output_dir annotations

Note: Supported datasets include cifar10, cifar100, cub, places365, and imagenet. The generated annotations will be saved under annotations folder.

Training Pipeline

Download annotated data from here, unzip them, and place it in the annotations folder or generate it using Grounding DINO as described in the previous section.
All datasets must be placed in a single folder specified by the environment variable $DATASET_FOLDER. By default, $DATASET_FOLDER is set to datasets.

Note: To download and process CUB dataset, please run bash download_cub.sh and move the folder under $DATASET_FOLDER. To use ImageNet dataset, you need to download the ImageNet dataset yourself and put it under $DATASET_FOLDER. The other datasets could be downloaded automatically by Torchvision.

Train a concept bottleneck model using the config files in ./configs. For instance, to train a CUB model, run the following command:

  python train_cbm.py --config configs/cub.json --annotation_dir annotations

Evaluate trained models

Number of Effective Concepts (NEC) needs to be controlled to enable a fair comparison of model performance. To evaluate a trained model under different NEC, run the following command:

python sparse_evaluation.py --load_path <path-to-model-dir> --lam <lambda-value>

Scripts for replication

Run the scripts here Scripts for reproducing some of the result in this paper.

Results

Accuracy at NEC=5 (ANEC-5) for non-CLIP backbone models

Dataset	CIFAR10	CIFAR100	CUB200	Places365	ImageNet
Random	67.55%	29.52%	68.91%	17.57%	41.49%
LF-CBM	84.05%	56.52%	53.51%	37.65%	60.30%
LM4CV	53.72%	14.64%	N/A	N/A	N/A
LaBo	78.69%	44.82%	N/A	N/A	N/A
VLG-CBM(Ours)	88.55%	65.73%	75.79%	41.92%	73.15%

Accuracy at NEC=5 (ANEC-5) for CLIP backbone models

Dataset	CIFAR10	CIFAR100	ImageNet	CUB
Random	67.55%	29.52%	18.04%	25.37%
LF-CBM	84.05%	56.52%	52.88%	31.35%
LM4CV	53.72%	14.64%	3.77%	3.63%
LaBo	78.69%	44.82%	24.27%	41.97%
VLG-CBM (Ours)	88.55%	65.73%	59.74%	60.38%

Explainable Decisions