A faithful Rust port of @karpathy's microgpt.py — the most atomic way to train and run inference for a GPT, now in pure Rust with zero ML framework dependencies.
"This file is the complete algorithm. Everything else is just efficiency."
A single-file, from-scratch implementation of a GPT language model that:
- Downloads a dataset of ~32K human names
- Trains a character-level transformer to learn the patterns
- Generates new, hallucinated names
Everything is built from first principles — scalar autograd, transformer architecture, Adam optimizer — with no ML libraries.
The entire implementation lives in src/main.rs and mirrors the structure of the original Python script. Here's what each piece does:
The foundation of the system. Every number in the model is wrapped in a ValueRef — a reference-counted smart pointer (Rc<RefCell<ValueInner>>) to a node in a computation graph.
ValueInner {
data: f64, // the scalar value (forward pass)
grad: f64, // ∂loss/∂this_node (backward pass)
children: Vec, // parent nodes in the graph
local_grads: Vec, // local derivatives w.r.t. children
}
Supported operations (all tracked for backprop):
- Arithmetic:
+,-,*,/via Rust operator overloading (Add,Mul,Sub,Div,Negtraits) - Functions:
pow,log,exp,relu
Backward pass (backward()): Performs an iterative DFS to build a topological ordering of the graph, then walks it in reverse to accumulate gradients via the chain rule.
- Dataset: Downloads names.txt (~32K names) on first run, caches it as
input.txt - Tokenizer: Character-level — each unique character gets an integer ID (0..n-1), plus a special BOS (Beginning of Sequence) token
A simplified GPT-2 architecture with a few deliberate simplifications:
| Component | Detail |
|---|---|
| Normalization | RMSNorm (instead of LayerNorm) |
| Activation | ReLU (instead of GeLU) |
| Biases | None |
| Layers | 1 transformer layer |
| Embedding dim | 16 |
| Attention heads | 4 (head_dim = 4) |
| Context length | 16 tokens |
Model functions:
linear(x, W)— matrix-vector multiply (each row of W is a neuron)softmax(logits)— numerically-stable softmax with max subtractionrmsnorm(x)— root-mean-square normalizationgpt(token, pos, kv_cache, ...)— full forward pass for one token:- Embed token + position
- RMSNorm
- Multi-head self-attention (with KV cache)
- Residual connection
- MLP (fc1 → ReLU → fc2)
- Residual connection
- Project to logits via
lm_head
- Optimizer: Adam (lr=0.01, β₁=0.85, β₂=0.99, ε=1e-8) with linear learning rate decay
- Loss: Cross-entropy (negative log probability of the correct next token)
- Steps: 1000 iterations, each processing one name
- Procedure per step:
- Tokenize a name, wrap with BOS tokens
- Forward each token through the model, collecting losses
- Average the losses
- Backpropagate gradients through the entire computation graph
- Adam update on all parameters
- Zero gradients
After training, the model generates 20 new names by:
- Starting with BOS token
- Running the forward pass to get logits
- Applying temperature scaling (temperature=0.5)
- Sampling next token from the probability distribution
- Repeating until BOS is generated or max length reached
- Rust (edition 2021+)
- Internet connection (for first-run dataset download)
# Clone the repo
git clone <repo-url> && cd microgpt
# Run (release mode recommended — ~2 min for 1000 steps)
cargo run --releasenum docs: 32033
vocab size: 27
num params: 4720
step 1000 / 1000 | loss 2.0378
--- inference (new, hallucinated names) ---
sample 1: rirela
sample 2: rarnil
sample 3: emile
sample 4: jama
...
Note: This is pure scalar autograd — every multiplication is a separate graph node. It's intentionally slow to emphasize clarity over performance. The ~2 minute runtime (release mode) is expected.
You can tweak the model hyperparameters directly in main.rs:
| Parameter | Default | Description |
|---|---|---|
n_layer |
1 | Number of transformer layers |
n_embd |
16 | Embedding dimension |
block_size |
16 | Max context length |
n_head |
4 | Number of attention heads |
num_steps |
1000 | Training iterations |
learning_rate |
0.01 | Initial learning rate |
temperature |
0.5 | Inference sampling temperature (0,1] |
microgpt/
├── Cargo.toml # Dependencies: rand, ureq
├── LICENSE # MIT License
├── README.md # You are here
├── src/
│ └── main.rs # The entire implementation (~580 lines)
└── input.txt # Downloaded on first run (names dataset)
Only two crates, both lightweight:
| Crate | Purpose |
|---|---|
rand 0.8 |
Seeded RNG, weighted sampling for inference |
ureq 2 |
Blocking HTTP to download the dataset |
- Original Python implementation: Andrej Karpathy — microgpt.py
- Dataset: makemore/names.txt
This project is licensed under the MIT License — see the LICENSE file for details.