A PyTorch implementation of pipeline parallelism strategies for training large neural networks across multiple GPUs/devices.
This project implements three pipeline parallelism schedules:
- Naive (GPipe-style): Simple stop-and-wait pipeline with all-forward then all-backward
- GPipe: Micro-batch pipeline with forward warmup, then backward drain
- 1F1B (PipeDream): One-Forward-One-Backward schedule for improved pipeline efficiency
# Using uv (recommended)
uv pip install -e .
# Or using pip
pip install -e .Run with 4 pipeline stages:
uv run torchrun --nproc-per-node 4 src/main.pyEdit src/config/config.py to adjust:
BATCH_SIZE: Input batch sizeHIDDEN_DIM: Model hidden dimensionTOTAL_LAYERS: Total number of layers (divided across stages)CHUNKS: Number of micro-batches for GPipe/1F1BSTEPS: Training steps
src/
├── main.py # Main training loop
├── model.py # ShardedMLP model definition
├── schedule.py # Pipeline schedules (Naive, GPipe, 1F1B)
├── communication.py # Distributed communication primitives
├── profiler.py # Performance profiling utilities
└── config/
└── config.py # Training configurationSimple sequential execution: forward all micro-batches, then backward all micro-batches.
Splits batches into micro-batches and pipelines them through stages:
- Forward warmup phase
- Backward drain in reverse order
- Reduces pipeline bubbles compared to naive approach
Interleaves forward and backward passes for better efficiency:
- Warmup: Forward-only passes to fill pipeline
- Steady state: Alternating forward and backward
- Drain: Backward-only passes to complete remaining gradients
- Uses async communication to prevent blocking
The profiler tracks:
- Compute time (forward/backward)
- Communication time (send/receive)
- Pipeline bubbles (idle time)
Results are printed after training for each rank.
See LICENSE file for details.
thanks to freeCodeCamp , kiankyars for teaching this, goated stuff.
(I needed to take the help of claude while implementing onef_oneb as i was stuggling with async behaviour)