A Swift port of Andrej Karpathy's microgpt, built directly on Apple Silicon GPUs using Metal Performance Shaders Graph (MPSGraph).
It's mathematically identical to the original Python version (same architecture, same Adam optimizer, same loss math) but runs a couple of orders of magnitude faster since it's fully native on GPU.
I stuck to Karpathy's minimal GPT-2-inspired design with a few simplifications:
- Decoder-only transformer (configurable depth/width)
- Multi-head causal self-attention with separate Q, K, V, O projection matrices
- RMSNorm instead of LayerNorm, and no biases anywhere
- Pre-norm residual connections on attention and MLP blocks
- MLP block: standard linear -> ReLU -> linear (4x expansion)
- Adam optimizer with full bias correction and learning rate decay
- Masked loss: cross-entropy averaged over real tokens (ignores padding)
- Temperature sampling with numerically stable softmax for generation
- Top-level async/await to play nice with Swift 6 strict concurrency
Benchmarks on M-series hardware (training on 32K names):
| Script | 1,000 steps | 5,000 steps | CPU/GPU |
|---|---|---|---|
microgpt.py (Karpathy) |
~80s | ~6.5m | CPU (Single-thread scalar) |
SwiftMicroGPT |
~1.7s | ~9.5s | GPU (MPSGraph vectorized) |
Why is the Swift/Metal version so fast?
- Parallelism: MPSGraph runs all that matrix math across the GPU cores simultaneously, rather than chugging through Python's loops.
- Graph compilation: The forward pass, backward pass, and optimizer are compiled into one giant pre-optimized operation tree. No step-by-step overhead.
- Metal kernels: Apple's backend is hyper-optimized for their own SIMD/matrix silicon.
- Unified memory: CPU and GPU share the same RAM, so we don't have to copy state back and forth across a slow PCIe bus like on traditional discrete GPUs.
After 5,000 training steps on the names dataset:
Dataset loaded. Vocab Size: 27, Documents: 32033
--- GPU Training Initiated ---
Step 0 | Loss: 9.2592
Step 1000 | Loss: 2.5203
Step 2000 | Loss: 2.8649
Step 3000 | Loss: 2.6742
Step 4000 | Loss: 2.5009
Step 4999 | Loss: 2.3209
Training Completed in 11.06 seconds.
--- Inference Generation ---
Sample 1: kanare
Sample 2: andie
Sample 3: bannalie
Sample 4: jadona
Sample 5: mailann
Sample 6: rinan
Sample 7: adilen
Sample 8: danion
Sample 9: mashan
Sample 10: maman
- macOS with Apple Silicon (M1 or newer)
- Swift 6.1+ (included with Xcode 16+)
swiftc -O -o microgpt src/SwiftMicroGPT.swift src/main.swift \
-framework Metal \
-framework MetalPerformanceShaders \
-framework MetalPerformanceShadersGraph
./microgptThe dataset is expected at src/data/names.txt relative to the working directory.
Hyperparameters are defined in the GPTConfig struct:
struct GPTConfig {
let numberOfLayers: Int = 1
let embeddingDimension: Int = 16
let contextWindowSize: Int = 16
let numberOfHeads: Int = 4
var headDimension: Int { embeddingDimension / numberOfHeads }
let learningRate: Float = 0.01
let beta1: Float = 0.85
let beta2: Float = 0.99
let adamEpsilon: Float = 1e-8
let numberOfTrainingSteps: Int = 5000
let temperature: Float = 0.5
}| Parameter | Description |
|---|---|
numberOfLayers |
Number of transformer blocks |
embeddingDimension |
Embedding dimension (model width) |
contextWindowSize |
Maximum context length (attention window) |
numberOfHeads |
Number of attention heads (must divide embeddingDimension) |
learningRate |
Peak learning rate (linearly decayed to zero) |
beta1, beta2 |
Adam momentum coefficients |
numberOfTrainingSteps |
Total training iterations |
temperature |
Sampling temperature for inference (lower = more conservative) |
MIT License. See LICENSE for details.
*P.S. Expecting a kid and can't decide on a name? Dump your shortlist into src/data/names.txt, spin up a quick run, and let your Mac's GPU name your baby.