Multi-node distributed training examples on HPC with PyTorch DDP.
All experiments were run on Swansea Sunbird HPC. The logs/ and runs/ directories contain actual training logs and TensorBoard events.
| Test | GPUs | Partition | Script |
|---|---|---|---|
| Single GPU MatMul | 1× A100-40GB | accel_ai | test_single_gpu.py |
| Multi-GPU MatMul (DP + DDP) | 2× A100-40GB | accel_ai | test_multi_gpu.py |
| Multi-GPU MatMul (DP + DDP) | 4× A100-40GB | accel_ai | test_multi_gpu.py |
| Task | Model | Dataset | Hardware | Time | Best Metric |
|---|---|---|---|---|---|
| Image Generation | DDPM (U-Net) | CIFAR-10 | 3 nodes × 2 V100 | ~47 min | Loss: 0.030 |
| Face Segmentation | DeepLabV3+ (pretrained backbone) | CelebAMask-HQ | 3 nodes × 2 V100 | ~1.7 hrs | 72.72% mIoU |
| Face Segmentation | DeepLabV3+ (scratch) | CelebAMask-HQ | 3 nodes × 2 V100 | ~3.4 hrs | 75.36% mIoU |
| Model | Epochs | mIoU | Pixel Acc | Mean Dice |
|---|---|---|---|---|
| DeepLabV3+ (pretrained) | 100 | 72.72% | 95.06% | 82.92% |
| DeepLabV3+ (scratch) | 200 | 75.36% | 94.78% | 84.96% |
git clone https://github.com/arvinsingh/Darkbird.git
cd Darkbird
pip install -r requirements.txt
# generate samples from trained diffusion model
python scripts/sample.py \
--checkpoint checkpoints/ddpm_cifar10_v100_8124355/model_ema.pt \
--output-dir samples --gif
# run segmentation inference on any face image
python scripts/visualize_segmentation.py \
--images your_face.jpg \
--pretrained-ckpt checkpoints/segmentation/seg_pretrained/checkpoint_best.pt \
--scratch-ckpt checkpoints/segmentation/seg_scratch/checkpoint_best.pt| Model | Dataset | Checkpoint |
|---|---|---|
| DDPM | CIFAR-10 | HuggingFace |
| DeepLabV3+ | CelebAMask-HQ | HuggingFace |
sbatch jobs/train_diffusion.sh
# submits jobs for pretrained & from-scratch training
sbatch jobs/run_segmentation.shKey job script settings
#SBATCH --partition=gpu
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2gpu partition for V100
accel_ai, accel_ai_dev partitions for A100
The repository includes actual training artifacts from Sunbird HPC
# training logs
cat logs/diff_v100_8124355.out # diffusion training
cat logs/seg_pretrained_8124763.out # segmentation training
cat logs/test_4gpu_8124513.out # 4× A100 scaling test
# view TensorBoard
tensorboard --logdir runs/Detailed writeup: Multi-Node Distributed Training
MIT





