Skip to content
/ torch-cuda Public template

PyTorch CUDA 12.8 project template - GPU-accelerated ML with modern Python packaging (uv), checkpointing, early stopping, and reproducibility utilities.

License

Notifications You must be signed in to change notification settings

synapticore-io/torch-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โšก PyTorch CUDA Template

Python Version PyTorch CUDA License: MIT uv Code style: black

๐Ÿš€ A blazing-fast Python template for GPU-accelerated machine learning

Harness the full power of modern PyTorch with CUDA 12.8 acceleration ๐Ÿ”ฅ


๐ŸŒŸ Overview

PyTorch CUDA Template provides everything you need to jumpstart your GPU-accelerated machine learning projects. Built with modern Python packaging standards and optimized for PyTorch 2.7+ with CUDA 12.8 support, this template eliminates setup friction so you can focus on building amazing models.

๐ŸŽฏ Key Features

  • ๐Ÿ”ฅ Cutting-Edge PyTorch - Latest PyTorch 2.7+ with optimized CUDA 12.8 support
  • โšก GPU-Ready Architecture - Pre-configured CUDA acceleration with intelligent CPU fallback
  • ๐Ÿ› ๏ธ Modern Development Stack - Integrated linting, formatting, testing, and type checking
  • ๐Ÿ“Š ML Ops Ready - MLflow experiment tracking and Polars for high-performance data processing
  • ๐Ÿš€ Lightning-Fast Setup - Powered by uv for blazing-fast dependency resolution
  • ๐Ÿ—๏ธ Production-Ready Structure - Following modern Python packaging best practices

๐Ÿ“‹ Requirements

  • ๐Ÿ Python โ‰ฅ 3.11
  • ๐ŸŽฎ CUDA 12.8 (for GPU acceleration)
  • ๐Ÿ’ป GPU Compatible NVIDIA GPU (optional, gracefully falls back to CPU)
  • โšก uv Package manager (recommended for fastest installs)

๐Ÿš€ Installation

โšก Lightning-Fast Setup

# Clone the template
git clone https://github.com/bjoernbethge/torch-cuda.git
cd torch-cuda

# Install everything with uv (recommended)
uv sync

๐ŸŽ›๏ธ Customized Installation

Choose exactly what you need:

# ๐Ÿ”ฅ Basic PyTorch setup
uv sync

# ๐Ÿงช Development environment (testing, linting, formatting)
uv sync --extra dev

# ๐Ÿ“Š ML Ops toolkit (MLflow, Polars, Plotly, profiling tools)
uv sync --extra extras

# ๐ŸŒŸ Everything included (the full experience)
uv sync --extra all
# Add new packages
uv add torchvision

๐Ÿš€ Quick Start Guide

1. ๐Ÿ” Verify Your GPU Setup

import torch

print(f"๐Ÿ”ฅ PyTorch version: {torch.__version__}")
print(f"โšก CUDA available: {torch.cuda.is_available()}")
print(f"๐ŸŽฎ CUDA version: {torch.version.cuda}")
print(f"๐Ÿ’ป GPU count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"๐Ÿš€ Current GPU: {torch.cuda.get_device_name()}")
    print(f"๐Ÿ’พ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

2. ๐Ÿง  Create Your First Model

import torch
import torch.nn as nn

# ๐ŸŽฏ Automatically detect best device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"๐Ÿš€ Using device: {device}")

# ๐Ÿง  Build a neural network
class SimpleNet(nn.Module):
    def __init__(self, input_size=784, hidden_size=256, num_classes=10):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, num_classes)
        )
    
    def forward(self, x):
        return self.network(x)

# ๐Ÿš€ Instantiate and move to GPU
model = SimpleNet().to(device)

# ๐Ÿ“Š Model info
total_params = sum(p.numel() for p in model.parameters())
print(f"๐Ÿง  Model parameters: {total_params:,}")

# ๐ŸŽฏ Test forward pass
sample_input = torch.randn(32, 784).to(device)
output = model(sample_input)
print(f"๐Ÿ“Š Input shape: {sample_input.shape}")
print(f"๐Ÿ“ˆ Output shape: {output.shape}")

3. ๐Ÿ‹๏ธ Train with MLflow Tracking

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm
import mlflow
import mlflow.pytorch

# ๐Ÿ“Š Initialize MLflow experiment
mlflow.set_experiment("pytorch-cuda-training")
mlflow.start_run()

# ๐ŸŽฏ Setup training environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet().to(device)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()

# ๐Ÿ“ˆ Log hyperparameters
mlflow.log_params({
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 10,
    "device": str(device),
    "model_params": sum(p.numel() for p in model.parameters())
})

# ๐Ÿ“Š Create sample dataset
X = torch.randn(1000, 784)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
dataloader = DataLoader(
    dataset, 
    batch_size=32, 
    shuffle=True,
    num_workers=4,  # ๐Ÿš€ Parallel data loading
    pin_memory=True  # โšก Faster GPU transfer
)

# ๐Ÿ‹๏ธ Training loop with MLflow logging
model.train()
for epoch in range(10):
    epoch_loss = 0
    correct_predictions = 0
    
    pbar = tqdm(dataloader, desc=f"๐Ÿ‹๏ธ Epoch {epoch+1}/10")
    
    for batch_x, batch_y in pbar:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        pred = outputs.argmax(dim=1)
        correct_predictions += (pred == batch_y).sum().item()
        
        pbar.set_postfix({'Loss': f'{loss.item():.4f}'})
    
    # ๐Ÿ“Š Log metrics to MLflow
    avg_loss = epoch_loss / len(dataloader)
    accuracy = correct_predictions / len(dataset)
    
    mlflow.log_metrics({
        "loss": avg_loss,
        "accuracy": accuracy,
        "epoch": epoch + 1
    })
    
    print(f"๐ŸŽฏ Epoch {epoch+1}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.3f}")

# ๐Ÿ’พ Save model
mlflow.pytorch.log_model(model, "model")
mlflow.end_run()

4. ๐Ÿ“Š High-Performance Data Processing with Polars

import polars as pl
import torch
from torch.utils.data import Dataset, DataLoader

# ๐Ÿ“Š Create and process data with Polars (much faster than pandas)
def create_sample_dataset():
    """Create a sample dataset using Polars for high-performance processing"""
    
    # ๐Ÿš€ Generate sample data with Polars
    df = pl.DataFrame({
        "feature_1": pl.Series([i * 0.1 for i in range(10000)]),
        "feature_2": pl.Series([i * 0.2 + 1 for i in range(10000)]),
        "feature_3": pl.Series([i * 0.05 - 0.5 for i in range(10000)]),
        "target": pl.Series([i % 3 for i in range(10000)])
    })
    
    # ๐Ÿ“ˆ High-performance data transformations
    processed_df = (
        df
        .with_columns([
            # ๐Ÿ”„ Feature engineering
            ((pl.col("feature_1") * pl.col("feature_2")).alias("interaction_1")),
            (pl.col("feature_3").pow(2).alias("feature_3_squared")),
            # ๐Ÿ“Š Normalization
            ((pl.col("feature_1") - pl.col("feature_1").mean()) / pl.col("feature_1").std()).alias("feature_1_norm"),
            ((pl.col("feature_2") - pl.col("feature_2").mean()) / pl.col("feature_2").std()).alias("feature_2_norm")
        ])
        .filter(pl.col("feature_1") > 0.5)  # ๐ŸŽฏ Fast filtering
    )
    
    print(f"๐Ÿ“Š Processed {len(processed_df)} samples")
    return processed_df

# ๐ŸŽฏ Custom Dataset class for Polars integration
class PolarsDataset(Dataset):
    def __init__(self, df: pl.DataFrame, feature_cols: list, target_col: str):
        self.features = torch.tensor(df.select(feature_cols).to_numpy(), dtype=torch.float32)
        self.targets = torch.tensor(df.select(target_col).to_numpy().flatten(), dtype=torch.long)
    
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]

# ๐Ÿš€ Use the high-performance dataset
df = create_sample_dataset()
feature_cols = ["feature_1_norm", "feature_2_norm", "feature_3_squared", "interaction_1"]

dataset = PolarsDataset(df, feature_cols, "target")
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, pin_memory=True)

print(f"โœ… Created dataset with {len(dataset)} samples and {len(feature_cols)} features")

5. ๐Ÿ“ˆ Interactive Visualization with Plotly

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import torch
import numpy as np

def visualize_training_metrics(losses, accuracies, gpu_utilization=None):
    """Create interactive training visualizations"""
    
    # ๐Ÿ“Š Create subplots
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('๐Ÿ‹๏ธ Training Loss', '๐ŸŽฏ Accuracy', 'โšก GPU Utilization', '๐Ÿ“ˆ Learning Curve'),
        specs=[[{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"secondary_y": False}]]
    )
    
    epochs = list(range(1, len(losses) + 1))
    
    # ๐Ÿ“‰ Loss curve
    fig.add_trace(
        go.Scatter(x=epochs, y=losses, mode='lines+markers', name='Loss', line=dict(color='red')),
        row=1, col=1
    )
    
    # ๐ŸŽฏ Accuracy curve
    fig.add_trace(
        go.Scatter(x=epochs, y=accuracies, mode='lines+markers', name='Accuracy', line=dict(color='green')),
        row=1, col=2
    )
    
    # โšก GPU utilization (if available)
    if gpu_utilization:
        fig.add_trace(
            go.Scatter(x=epochs, y=gpu_utilization, mode='lines+markers', name='GPU %', line=dict(color='blue')),
            row=2, col=1
        )
    
    # ๐Ÿ“ˆ Combined learning curve
    fig.add_trace(
        go.Scatter(x=epochs, y=losses, mode='lines', name='Loss (normalized)', line=dict(color='red', dash='dot')),
        row=2, col=2
    )
    fig.add_trace(
        go.Scatter(x=epochs, y=accuracies, mode='lines', name='Accuracy', line=dict(color='green')),
        row=2, col=2
    )
    
    # ๐ŸŽจ Update layout
    fig.update_layout(
        title="๐Ÿš€ PyTorch CUDA Training Dashboard",
        showlegend=True,
        height=600
    )
    
    return fig

# ๐Ÿ“Š Example usage
sample_losses = [2.3, 1.8, 1.4, 1.1, 0.9, 0.7, 0.6, 0.5, 0.4, 0.35]
sample_accuracies = [0.1, 0.3, 0.5, 0.65, 0.75, 0.82, 0.87, 0.91, 0.94, 0.96]
sample_gpu_util = [85, 87, 90, 88, 92, 89, 91, 88, 90, 87]

fig = visualize_training_metrics(sample_losses, sample_accuracies, sample_gpu_util)
fig.show()  # ๐ŸŽฏ Interactive visualization in browser

6. โšก Performance Monitoring with GPU Profiling

import torch
from torch.profiler import profile, record_function, ProfilerActivity
import psutil
import time

def profile_training_step(model, data_loader, device):
    """Profile training performance with detailed GPU metrics"""
    
    # ๐Ÿ” Start profiling
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        
        model.train()
        for i, (batch_x, batch_y) in enumerate(data_loader):
            if i >= 5:  # Profile first 5 batches
                break
                
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            
            with record_function("forward_pass"):
                outputs = model(batch_x)
                loss = torch.nn.functional.cross_entropy(outputs, batch_y)
            
            with record_function("backward_pass"):
                loss.backward()
            
            with record_function("optimizer_step"):
                torch.optim.Adam(model.parameters()).step()
    
    # ๐Ÿ“Š Print profiling results
    print("๐Ÿ”ฅ GPU Profiling Results:")
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
    
    # ๐Ÿ’พ Export for visualization
    prof.export_chrome_trace("trace.json")
    print("๐Ÿ“ˆ Trace exported to trace.json - open in chrome://tracing")

def monitor_system_resources():
    """Monitor CPU, memory, and GPU usage"""
    
    # ๐Ÿ’ป System resources
    cpu_percent = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    
    print(f"๐Ÿ’ป CPU Usage: {cpu_percent}%")
    print(f"๐Ÿ’พ RAM Usage: {memory.percent}% ({memory.used / 1e9:.1f}GB / {memory.total / 1e9:.1f}GB)")
    
    # ๐ŸŽฎ GPU resources
    if torch.cuda.is_available():
        gpu_memory = torch.cuda.memory_allocated() / 1e9
        gpu_cached = torch.cuda.memory_reserved() / 1e9
        gpu_total = torch.cuda.get_device_properties(0).total_memory / 1e9
        
        print(f"๐ŸŽฎ GPU Memory: {gpu_memory:.1f}GB allocated, {gpu_cached:.1f}GB cached, {gpu_total:.1f}GB total")
        print(f"๐Ÿ“Š GPU Utilization: {(gpu_memory/gpu_total)*100:.1f}%")

# ๐Ÿš€ Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet().to(device)

# Monitor during training
monitor_system_resources()

๐Ÿงช Development Workflow

๐Ÿ› ๏ธ Setup Development Environment

# ๐Ÿ“ฆ Install all development tools
uv sync --extra dev

# ๐Ÿช Setup pre-commit hooks for code quality
pre-commit install

# ๐Ÿงช Verify everything works
pytest --version && black --version && mypy --version

โœจ Code Quality Arsenal

# ๐ŸŽจ Format your code beautifully
black src/ tests/
isort src/ tests/

# ๐Ÿ” Lint and catch issues
ruff check src/ tests/

# ๐ŸŽฏ Type checking for better code
mypy src/

# ๐Ÿงช Run comprehensive tests
pytest

# ๐Ÿ“Š Test coverage analysis
pytest --cov=src --cov-report=html

๐Ÿš€ Performance Optimization Guide

โšก GPU Memory Optimization

# ๐Ÿ’พ Monitor GPU memory usage
def print_gpu_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        cached = torch.cuda.memory_reserved() / 1e9
        print(f"๐Ÿ’พ GPU Memory - Allocated: {allocated:.2f}GB, Cached: {cached:.2f}GB")

# ๐Ÿงน Memory cleanup strategies
def cleanup_gpu_memory():
    """Clean up GPU memory periodically"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# ๐Ÿ“Š Gradient accumulation for large effective batch sizes
accumulation_steps = 4
for i, (batch_x, batch_y) in enumerate(dataloader):
    outputs = model(batch_x)
    loss = criterion(outputs, batch_y) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

๐Ÿ”ฅ Training Acceleration

# โšก DataLoader optimization
dataloader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=min(8, os.cpu_count()),  # Optimal worker count
    pin_memory=True,  # Faster GPU transfer
    persistent_workers=True,  # Keep workers alive
    prefetch_factor=2  # Prefetch batches
)

# ๐Ÿš€ Model compilation (PyTorch 2.0+)
model = torch.compile(
    model, 
    mode="max-autotune",  # Maximum optimization
    dynamic=False  # Static shapes for better optimization
)

# ๐Ÿ’ก Learning rate scheduling
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    steps_per_epoch=len(dataloader),
    epochs=num_epochs,
    pct_start=0.3,  # 30% warmup
    anneal_strategy='cos'
)

๐Ÿค Contributing

We welcome contributions from the community! Here's how to get involved:

๐Ÿ› ๏ธ Development Setup

  1. ๐Ÿด Fork the repository on GitHub
  2. ๐Ÿ“ฅ Clone your fork: git clone https://github.com/yourusername/torch-cuda.git
  3. ๐Ÿ“ฆ Install in development mode: uv sync --extra dev
  4. ๐ŸŒฟ Create a feature branch: git checkout -b feature/amazing-feature
  5. โœจ Make your changes and add comprehensive tests
  6. ๐Ÿงช Run the test suite: pytest
  7. ๐ŸŽจ Format your code: black . && isort .
  8. ๐Ÿ“ Commit your changes: git commit -m 'Add amazing feature'
  9. ๐Ÿš€ Push to your branch: git push origin feature/amazing-feature
  10. ๐Ÿ”„ Submit a Pull Request

๐Ÿ†˜ Troubleshooting

๐Ÿ”ฅ Common CUDA Issues

โŒ CUDA Out of Memory

# ๐Ÿ’ก Solutions:
# 1. Reduce batch size
batch_size = 16  # Instead of 64

# 2. Use gradient accumulation
accumulation_steps = 4

# 3. Enable mixed precision
from torch.cuda.amp import autocast
with autocast():
    outputs = model(inputs)

# 4. Clear cache periodically
torch.cuda.empty_cache()

๐ŸŒ Slow Training Performance

# ๐Ÿ’ก Performance boosters:
# 1. Optimize DataLoader
dataloader = DataLoader(
    dataset,
    num_workers=4,      # Parallel loading
    pin_memory=True,    # Faster GPU transfer
    persistent_workers=True  # Keep workers alive
)

# 2. Enable optimizations
torch.backends.cudnn.benchmark = True
model = torch.compile(model)

# 3. Use appropriate batch sizes
# Sweet spot is usually 32-128 depending on model size

๐Ÿšซ Installation Issues

# ๐Ÿ”„ Refresh installation
uv sync --extra all

# ๐Ÿงน Clean cache and reinstall
uv cache clean && uv sync

# ๐ŸŽฏ Verify uv configuration
uv tree

๐Ÿ†˜ Getting Help


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • ๐Ÿ”ฅ PyTorch Team - For creating the most amazing deep learning framework
  • ๐ŸŽฎ NVIDIA - For CUDA toolkit and GPU computing revolution
  • โšก Astral Team - For the blazing-fast uv package manager
  • ๐Ÿ“Š Polars Team - For lightning-fast data processing
  • ๐ŸŒŸ Open Source Community - For continuous inspiration and collaboration

๐Ÿ“ž Connect & Links

GitHub Email

Made with โค๏ธ and โšก GPU acceleration


Built with ๐Ÿ”ฅ PyTorch โ€ข Accelerated by โšก CUDA โ€ข Powered by ๐Ÿš€ uv & Modern Python

About

PyTorch CUDA 12.8 project template - GPU-accelerated ML with modern Python packaging (uv), checkpointing, early stopping, and reproducibility utilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages