Skip to content

ebrahimpichka/multimodal-representation-learning

Repository files navigation

Vision-Language Retrieval Project

Usage

Run the training script from the project root using one of the configuration files:

# Baseline with frozen backbones
python src/main.py --config configs/baseline_frozen.yaml

# LoRA fine-tuning (Rank 16)
python src/main.py --config configs/lora_r16.yaml

# Full fine-tuning
python src/main.py --config configs/full_finetune.yaml

Configuration Parameters

Configuration is managed via YAML files. Key parameters include:

Model

Parameter Description Default/Example
image_model_name Name of the image backbone (torchvision or HF) resnet50
text_model_name Name of the text backbone (HF) bert-base-uncased
embed_dim Dimension of the shared embedding space 512
freeze_backbones Whether to freeze pre-trained weights true/false
use_lora Enable LoRA fine-tuning true/false
lora_r LoRA rank 16
lora_target_modules Modules to apply LoRA to ["query", "value"]
load_in_4bit Enable 4-bit quantization (QLoRA) false
load_in_8bit Enable 8-bit quantization false

Training

Parameter Description Default/Example
loss Loss function to use contrastive
mixed_precision Enable FP16 mixed-precision training true
batch_size Training batch size 64
num_epochs Number of training epochs 5
optimizer.name Optimizer name AdamW
optimizer.params.lr Learning rate 0.0001

Supported Loss Functions

Set the training.loss parameter in the config to one of the following:

  • contrastive: Standard symmetric cross-entropy loss (CLIP-style).
  • contrastive_semihard: Contrastive loss with semi-hard negative mining.
  • siglip: Sigmoid Loss for Language Image Pre-Training (requires model.use_bias: true).

Embedding Space Visualization

embedding_analysis.py visualizes the learned embedding space of the trained model using T-SNE. It fetches a trained model from Weights & Biases and runs inference on a balanced subset of the COCO dataset.

Usage

python src/analysis/embedding_analysis.py --run-path <entity>/<project>/<run_id> [options]

Arguments

  • --run-path: Required. The W&B run path (e.g., username/semantic-image-search/123456).
  • --model-filename: Filename of the model in W&B artifacts (default: main.pth).
  • --coco-annotation-file: Path to COCO annotations JSON (default: data/annotations/captions_val2017.json).
  • --coco-image-dir: Path to COCO images directory (default: data/val2017).
  • --samples-per-category: Number of samples to select per category (default: 20).
  • --categories: List of categories to visualize (default: cat dog car pizza).
  • --output-dir: Directory to save the plot (default: analysis_results).

Example

python src/analysis/embedding_analysis.py \
    --run-path ebrahimpichka/semantic-image-search/3x8j9k2l \
    --categories cat dog car \
    --samples-per-category 50

About

Code for the final project of the CS 7643 - Deep Learning course: "Multimodal Representation Learning for Semantic Image Retrieval"

Topics

Resources

Stars

Watchers

Forks

Contributors