HSPMN v2.1: Bio-Inspired Adaptive Computation 🚀

HSPMN v2.1 is a next-generation LLM architecture optimized for NVIDIA Blackwell (RTX 5090). It achieves extreme throughput on a single GPU by introducing bio-inspired adaptive computation patterns and hardware-native block sparsity.

"The brain doesn't use the full weight of the neocortex to process a simple 'hello'. Why should our models?"

🎯 Key Features

✅ Hybrid Execution Strategy: FlexAttention (Training) + Triton SQDK Kernels (Inference).
✅ Hardware-Native Sparsity: Custom Triton kernels optimized for H100/Blackwell (num_warps=4).
✅ 262k Context Window: Verified on RTX 5090 (11.94 GB VRAM usage).
✅ High Throughput: 1.41M tokens/sec (Production Scale, BF16).
✅ Entropy Minimization: Router learns crisp binary decisions.
✅ Zero Graph Breaks: Fully compatible with torch.compile.

🚀 Performance Verified (RTX 5090)

Metric	Value	Notes
Throughput	1,406,304 tok/s	Batch=64, Seq=4096, Dim=2048 (Triton Kernel)
Max Context	262,144 tokens	Batch=1, Dim=2048 (11.94 GB VRAM)
Latency	186 ms	End-to-end forward pass (Batch=64, 4k seq)
Training Speed	~980k tok/s	Real training with gradients (FlexAttention)

📦 Installation

Prerequisites: NVIDIA Driver 550+, CUDA 12.4+, Python 3.10+

# Clone repository
git clone https://github.com/your-username/HSPMN-v2.git
cd HSPMN-v2

# Install dependencies (Strictly pinned for stability)
pip install -r requirements.txt

🎓 Quick Start

1. Run Benchmarks

Verify your hardware capability immediately:

# Run both throughput and stress tests
python benchmark_v2_1.py --mode all

2. Simple Inference

import torch
from hspmn_v2_1 import HSPMNBlock, HSPMNConfig

# Configure for speed
config = HSPMNConfig(dim=2048, num_heads=16, sparsity_k=0.2)
model = HSPMNBlock(config).cuda().bfloat16()

# CRITICAL: Model uses compiled flex_attention internally
# Compile the full model for maximum performance
model = torch.compile(model, mode="reduce-overhead")

# Run (first call will be slow due to compilation)
x = torch.randn(1, 4096, 2048).cuda().bfloat16()
output, aux_loss = model(x)
print(output.shape)

3. High-Performance Training

python train_v2_1.py \
    --batch 32 \
    --seq_len 4096 \
    --dim 2048 \
    --steps 1000 \
    --grad_accum 4 \
    --wandb "hspmn-experiment-1"

4. Testing & Verification

Ensure kernel correctness and model integrity:

# Verify Triton kernels against PyTorch reference
python test_kernels_v2_1.py

# Verify saved checkpoints
python verify_models.py

🧠 Architecture Highlights

Reflexive Stream (System 1):
- Runs on all tokens.
- Components: RMSNorm -> Depthwise Conv1d -> SwiGLU MLP.
- Role: Syntax, grammar, shallow processing.
Contextual Stream (System 2):
- Runs on sparse tokens (Top-K Router).
- Inference: Uses custom Triton SQDK Kernel for max speed.
- Training: Uses FlexAttention for autograd support.
- Role: Logic, reasoning, long-range dependencies.
Router:
- Learned Top-K selection with Gumbel-Softmax.
- Entropy Minimization ensures the router makes confident (0 or 1) decisions.

💡 Core Concept & Applications

HSPMN v2.1 addresses the quadratic bottleneck of traditional Transformers by decoupling memory capacity from compute cost. While standard models process every token with equal intensity, HSPMN uses a Dual-System Architecture:

Reflexive Stream (System 1): Handles syntax and local patterns for all tokens (Linear complexity).
Contextual Stream (System 2): Activates heavy attention only for semantically dense tokens (Sparse complexity).

Real-World Use Cases

Private Long-Document Analysis: Process 500+ page legal/medical contracts locally on a single GPU without data leaving the premise.
Repository-Level Coding Agents: Ingest entire codebases (200k+ tokens) into context for "whole-project" awareness with low typing latency.
Real-Time Log Filtering: Efficiently scan terabytes of server logs, where the Router automatically learns to ignore repetitive noise and attend only to anomalies.

📂 Project Structure

HSPMN-v2/
├── hspmn_v2_1.py           # Core architecture (Clean, Type-hinted)
├── kernels_v2_1.py         # Custom Triton SQDK kernels
├── utils_v2_1.py           # Configuration and helper functions
├── train_v2_1.py           # Production-grade training script
├── benchmark_v2_1.py       # Unified benchmarking tool
├── test_kernels_v2_1.py    # Unit tests for Triton kernels
├── verify_models.py        # Checkpoint verification script
├── requirements.txt        # Minimal dependencies
└── README.md               # Documentation

Author: Szymon Jędryczko License: MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HSPMN v2.1: Bio-Inspired Adaptive Computation 🚀

🎯 Key Features

🚀 Performance Verified (RTX 5090)

📦 Installation

🎓 Quick Start

1. Run Benchmarks

2. Simple Inference

3. High-Performance Training

4. Testing & Verification

🧠 Architecture Highlights

💡 Core Concept & Applications

Real-World Use Cases

📂 Project Structure

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
HSPMN_v2_1.pdf		HSPMN_v2_1.pdf
HSPMN_v2_1.tex		HSPMN_v2_1.tex
README.md		README.md
benchmark_v2_1.py		benchmark_v2_1.py
hspmn_v2_1.py		hspmn_v2_1.py
kernels_v2_1.py		kernels_v2_1.py
requirements.txt		requirements.txt
test_kernels_v2_1.py		test_kernels_v2_1.py
test_v2_1.py		test_v2_1.py
train_v2_1.py		train_v2_1.py
utils_v2_1.py		utils_v2_1.py
verify_models.py		verify_models.py

NetBr3ak/HSPMN-v2.1

Folders and files

Latest commit

History

Repository files navigation

HSPMN v2.1: Bio-Inspired Adaptive Computation 🚀

🎯 Key Features

🚀 Performance Verified (RTX 5090)

📦 Installation

🎓 Quick Start

1. Run Benchmarks

2. Simple Inference

3. High-Performance Training

4. Testing & Verification

🧠 Architecture Highlights

💡 Core Concept & Applications

Real-World Use Cases

📂 Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages