GPzoo is a Python library for scalable Gaussian Process models tailored for spatial transcriptomics data analysis. It provides implementations of several state-of-the-art GP approximations with multi-group extensions for cell type-aware modeling.
-
Multiple GP Approximations:
- SVGP (Sparse Variational GP) with inducing points
- VNNGP (Variational Nearest Neighbor GP) for large datasets
- Multi-Group extensions (MGGP) for cell type-aware modeling
-
Spatial Transcriptomics Integration:
- Built-in support for SlideseqV2, 10x Visium, and liver datasets
- Automatic data loading and preprocessing
- Integration with Scanpy and Squidpy ecosystems
-
Scalable Training:
- Batched training for memory efficiency
- GPU acceleration support
- TensorBoard logging for monitoring
- Checkpoint resumption
-
Model Variants:
- SVGP-NSF (Neural Spectral Factorization)
- VNNGP-NSF
- Multi-Group versions of all models
git clone https://github.com/luisdiaz1997/GPzoo.git
cd GPzoo
pip install -e .Install required packages:
pip install -r requirements.txtFor GPU support, install PyTorch with CUDA:
# Visit https://pytorch.org/get-started/locally/ for the appropriate command
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118GPzoo provides three ways to create models, from simplest to most flexible:
The simplest way to get started. These classes handle all initialization automatically:
from gpzoo.models import SVGP_NSF, VNNGP_NSF, MGGP_SVGP_NSF, MGGP_VNNGP_NSF
# Load your spatial data
X, Y = load_your_data() # X: (N, 2) coordinates, Y: (genes, N) counts
# Standard SVGP for medium-sized datasets
model = SVGP_NSF(X, Y, L=12, lengthscale=8.0)
# VNNGP for large datasets (100k+ cells)
model = VNNGP_NSF(X, Y, L=12, K=50, lengthscale=8.0)
# Multi-group models for cell type-aware analysis
model = MGGP_SVGP_NSF(X, Y, groupsX, L=12, lengthscale=8.0)
model = MGGP_VNNGP_NSF(X, Y, groupsX, L=12, K=50, lengthscale=8.0)
# Move to GPU and train
model.to("cuda")Use build_model with dataset-specific configurations:
from gpzoo.models import build_model
model, metadata = build_model(
"slideseq/svgp_nsf",
X=X,
Y=Y,
L=10,
lengthscale=8.0,
device="cuda"
)
# Resume from checkpoint
model, _ = build_model(
"slideseq/svgp_nsf",
checkpoint_path="models/slideseq_svgp.pth",
X=X, Y=Y, L=10, lengthscale=8.0
)Build models from individual components for maximum flexibility:
from gpzoo.kernels import batched_Matern32, batched_MGGP_RBF
from gpzoo.gp import SVGP, VNNGP, MGGP_SVGP, MGGP_VNNGP
from gpzoo.likelihoods import NSF2
# 1. Choose a kernel
kernel = batched_Matern32(sigma=1.0, lengthscale=8.0)
# Or for multi-group:
kernel = batched_MGGP_RBF(sigma=1.0, lengthscale=8.0, group_diff_param=10.0, n_groups=5)
# 2. Choose a GP approximation
gp = SVGP(kernel, M=1000) # Sparse variational GP
gp = VNNGP(kernel, M=N, K=50) # Variational nearest neighbor GP
gp = MGGP_SVGP(kernel, M=1000) # Multi-group SVGP
gp = MGGP_VNNGP(kernel, M=N, K=50) # Multi-group VNNGP
# 3. Wrap with likelihood model
model = NSF2(gp, Y, L=12)Use the provided training scripts for each dataset:
python -m gpzoo.datasets.slideseq.svgp_nsf
python -m gpzoo.datasets.slideseq.vnngp_nsf
python -m gpzoo.datasets.slideseq.svgp_mggp_nsf
python -m gpzoo.datasets.slideseq.vnngp_mggp_nsf| Model | Command | Description |
|---|---|---|
| SVGP | python -m gpzoo.datasets.slideseq.svgp_nsf |
Standard sparse variational GP |
| SVGP-MGGP | python -m gpzoo.datasets.slideseq.svgp_mggp_nsf |
Multi-group SVGP |
| VNNGP | python -m gpzoo.datasets.slideseq.vnngp_nsf |
Nearest neighbor GP |
| VNNGP-MGGP | python -m gpzoo.datasets.slideseq.vnngp_mggp_nsf |
Multi-group VNNGP |
# Run multiple models on different GPUs
CUDA_VISIBLE_DEVICES=0 python -m gpzoo.datasets.slideseq.vnngp_nsf &
CUDA_VISIBLE_DEVICES=1 python -m gpzoo.datasets.slideseq.svgp_nsf &GPzoo/
├── gpzoo/ # Main library
│ ├── gp.py # GP implementations (SVGP, VNNGP, MGGP_*)
│ ├── kernels.py # GP kernel implementations
│ ├── likelihoods.py # Likelihood functions (NSF2, Gaussian)
│ ├── models/ # Model convenience classes and registry
│ │ ├── nsf.py # SVGP_NSF, VNNGP_NSF, MGGP_* classes
│ │ └── registry.py # build_model() and model registry
│ ├── model_utilities.py # Model construction utilities
│ ├── training_utilities.py # Training loops and logging
│ ├── utilities.py # General utilities
│ └── datasets/ # Dataset-specific code
│ ├── slideseq/ # SlideseqV2 dataset
│ ├── liver/ # Liver spatial transcriptomics
│ └── tenxvisium/ # 10x Visium dataset
├── notebooks/ # Example notebooks and analysis
├── models/ # Saved model checkpoints
└── requirements.txt # Python dependencies
Models are configured via shared configuration files in each dataset directory. Key parameters:
STEPS = 34000- Training iterationsX_BATCH = 7000- Spatial locations per batchY_BATCH = 1000- Genes per batchL_FACTORS = 10- Number of latent factorsLR = 1e-2- Base learning rateLR_SCALE = 1e-4- Learning rate for scale parametersLENGTHSCALE = 8.0- Kernel lengthscale
- Uses inducing points for scalability
- Matern32 kernel for spatial smoothness
- Neural Spectral Factorization (NSF) for covariance modeling
- Suitable for medium-sized datasets
- Uses K-nearest neighbors for approximation
- More scalable for very large datasets
- Default: K=50 nearest neighbors
- 1 expectation sample during training
- Extends kernels with group-specific parameters
- Incorporates cell type/cluster information
group_diff_paramcontrols between-group similarity- Requires
groupsXtensor indicating cell types
- Diagonal stored in log-space for positive definiteness
- During forward pass: diagonal exponentiated via
apply_constraints - Separate learning rate (
LR_SCALE) for scale parameters - Cholesky initialization for SVGP, SVD for VNNGP
Each training run produces:
*.pth- Model checkpoint*_losses.csv- Loss history*_losses.npy- Loss history (numpy)*.json- Training metadata (time, memory, device)- TensorBoard logs in
models/*/tb/
View TensorBoard logs:
tensorboard --logdir models/slideseq/tb/- Mouse brain spatial transcriptomics
- ~50,000 spots, ~20,000 genes (filtered)
- Cell type clusters from original publication
- Automatic loading via
load_slideseq_with_groups()
- Human liver disease vs healthy
- Spatial coordinates and gene counts
- Disease state labels for MGGP
- Standard spatial transcriptomics format
- Compatible with Scanpy/Squidpy workflows
from gpzoo.models import SVGP_NSF, MGGP_VNNGP_NSF
import torch
# Your data: X (coordinates), Y (gene counts), groups (optional)
X = torch.tensor(coordinates, dtype=torch.float32)
Y = torch.tensor(gene_counts, dtype=torch.float32)
groups = torch.tensor(cell_types, dtype=torch.long) # for MGGP
# Simple model
model = SVGP_NSF(X, Y, L=12, lengthscale=10.0)
# Multi-group model
model = MGGP_VNNGP_NSF(
X, Y, groups,
L=12,
K=50,
lengthscale=10.0,
group_diff_param=10.0
)
model.to("cuda")from gpzoo.models import build_model, load_checkpoint
# Option 1: Via build_model (creates new model and loads weights)
model, metadata = build_model(
"slideseq/svgp_nsf",
checkpoint_path="models/slideseq/slideseq_svgp.pth",
X=X, Y=Y, L=10, lengthscale=8.0
)
# Option 2: Load into existing model
model = SVGP_NSF(X, Y, L=10, lengthscale=8.0)
load_checkpoint(model, "models/slideseq/slideseq_svgp.pth")Modify configuration in dataset-specific config.py:
- Adjust
LENGTHSCALEfor spatial scale - Change
X_BATCH,Y_BATCHfor memory usage - Set
LENGTHSCALE_TRAIN_AFTERto unfreeze lengthscale - Modify
GROUP_DIFF_PARAMfor MGGP group similarity
- Reduce
X_BATCHandY_BATCHsizes - VNNGP uses more memory due to neighbor storage
- Monitor GPU memory with
nvidia-smi
- Reduce learning rate (
LR,LR_SCALE) - Increase
JITTERfor numerical stability - Check
LENGTHSCALEis appropriate for spatial scale
- Ensure checkpoint paths exist in
config.py - Delete corrupt checkpoints to start fresh
- Check TensorBoard logs for training curves
If you use GPzoo in your research, please cite:
@software{gpzoo2024,
title = {GPzoo: Gaussian Process Models for Spatial Transcriptomics},
author = {Diaz, Luis},
year = {2024},
url = {https://github.com/luisdiaz1997/GPzoo}
}- SVGP: Titsias (2009), "Variational Learning of Inducing Variables in Sparse Gaussian Processes"
- VNNGP: Wu et al. (2022), "Variational Nearest Neighbor Gaussian Process"
- SlideseqV2: 10x Genomics
- Scanpy: Wolf et al. (2018), "SCANPY: large-scale single-cell gene expression data analysis"
- Squidpy: Palla et al. (2022), "Squidpy: a scalable framework for spatial omics analysis"
MIT License
For questions or issues, please open an issue on GitHub or contact the maintainers.