Skip to content

arvinsingh/Darkbird

Repository files navigation

Darkbird

Multi-node distributed training examples on HPC with PyTorch DDP.

License: MIT Python 3.11+ PyTorch 2.4

Hugging Face

Experiments

All experiments were run on Swansea Sunbird HPC. The logs/ and runs/ directories contain actual training logs and TensorBoard events.

GPU Scaling Tests (A100)

Test GPUs Partition Script
Single GPU MatMul 1× A100-40GB accel_ai test_single_gpu.py
Multi-GPU MatMul (DP + DDP) 2× A100-40GB accel_ai test_multi_gpu.py
Multi-GPU MatMul (DP + DDP) 4× A100-40GB accel_ai test_multi_gpu.py

Training Runs (V100)

Task Model Dataset Hardware Time Best Metric
Image Generation DDPM (U-Net) CIFAR-10 3 nodes × 2 V100 ~47 min Loss: 0.030
Face Segmentation DeepLabV3+ (pretrained backbone) CelebAMask-HQ 3 nodes × 2 V100 ~1.7 hrs 72.72% mIoU
Face Segmentation DeepLabV3+ (scratch) CelebAMask-HQ 3 nodes × 2 V100 ~3.4 hrs 75.36% mIoU

Results

Diffusion (DDPM on CIFAR-10)

Generated samples Denoising process

Training Curve

Diffusion loss curve

Face Segmentation (CelebAMask-HQ)

Segmentation comparison

Model Epochs mIoU Pixel Acc Mean Dice
DeepLabV3+ (pretrained) 100 72.72% 95.06% 82.92%
DeepLabV3+ (scratch) 200 75.36% 94.78% 84.96%
Training Curves

Pretrained mIoU Scratch mIoU

Quick Start

git clone https://github.com/arvinsingh/Darkbird.git
cd Darkbird

pip install -r requirements.txt

# generate samples from trained diffusion model
python scripts/sample.py \
    --checkpoint checkpoints/ddpm_cifar10_v100_8124355/model_ema.pt \
    --output-dir samples --gif

# run segmentation inference on any face image
python scripts/visualize_segmentation.py \
    --images your_face.jpg \
    --pretrained-ckpt checkpoints/segmentation/seg_pretrained/checkpoint_best.pt \
    --scratch-ckpt checkpoints/segmentation/seg_scratch/checkpoint_best.pt

Model Zoo

Model Dataset Checkpoint
DDPM CIFAR-10 HuggingFace
DeepLabV3+ CelebAMask-HQ HuggingFace

Training on SLURM

sbatch jobs/train_diffusion.sh

# submits jobs for pretrained & from-scratch training
sbatch jobs/run_segmentation.sh

Key job script settings

#SBATCH --partition=gpu
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

gpu partition for V100
accel_ai, accel_ai_dev partitions for A100

Logs & TensorBoard

The repository includes actual training artifacts from Sunbird HPC

# training logs
cat logs/diff_v100_8124355.out            # diffusion training
cat logs/seg_pretrained_8124763.out       # segmentation training
cat logs/test_4gpu_8124513.out            # 4× A100 scaling test

# view TensorBoard
tensorboard --logdir runs/

Blog Post

Detailed writeup: Multi-Node Distributed Training

License

MIT

About

Multi-node distributed training examples on HPC with PyTorch DDP.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors