MicroGPT — Rust 🦀

A faithful Rust port of @karpathy's microgpt.py — the most atomic way to train and run inference for a GPT, now in pure Rust with zero ML framework dependencies.

"This file is the complete algorithm. Everything else is just efficiency."

What is this?

A single-file, from-scratch implementation of a GPT language model that:

Downloads a dataset of ~32K human names
Trains a character-level transformer to learn the patterns
Generates new, hallucinated names

Everything is built from first principles — scalar autograd, transformer architecture, Adam optimizer — with no ML libraries.

How it works

The entire implementation lives in src/main.rs and mirrors the structure of the original Python script. Here's what each piece does:

1. Autograd Engine (`ValueRef`)

The foundation of the system. Every number in the model is wrapped in a ValueRef — a reference-counted smart pointer (Rc<RefCell<ValueInner>>) to a node in a computation graph.

ValueInner {
    data: f64,        // the scalar value (forward pass)
    grad: f64,        // ∂loss/∂this_node (backward pass)
    children: Vec,    // parent nodes in the graph
    local_grads: Vec, // local derivatives w.r.t. children
}

Supported operations (all tracked for backprop):

Arithmetic: +, -, *, / via Rust operator overloading (Add, Mul, Sub, Div, Neg traits)
Functions: pow, log, exp, relu

Backward pass (backward()): Performs an iterative DFS to build a topological ordering of the graph, then walks it in reverse to accumulate gradients via the chain rule.

2. Dataset & Tokenizer

Dataset: Downloads names.txt (~32K names) on first run, caches it as input.txt
Tokenizer: Character-level — each unique character gets an integer ID (0..n-1), plus a special BOS (Beginning of Sequence) token

3. GPT Model

A simplified GPT-2 architecture with a few deliberate simplifications:

Component	Detail
Normalization	RMSNorm (instead of LayerNorm)
Activation	ReLU (instead of GeLU)
Biases	None
Layers	1 transformer layer
Embedding dim	16
Attention heads	4 (head_dim = 4)
Context length	16 tokens

Model functions:

linear(x, W) — matrix-vector multiply (each row of W is a neuron)
softmax(logits) — numerically-stable softmax with max subtraction
rmsnorm(x) — root-mean-square normalization
gpt(token, pos, kv_cache, ...) — full forward pass for one token:
1. Embed token + position
2. RMSNorm
3. Multi-head self-attention (with KV cache)
4. Residual connection
5. MLP (fc1 → ReLU → fc2)
6. Residual connection
7. Project to logits via lm_head

4. Training

Optimizer: Adam (lr=0.01, β₁=0.85, β₂=0.99, ε=1e-8) with linear learning rate decay
Loss: Cross-entropy (negative log probability of the correct next token)
Steps: 1000 iterations, each processing one name
Procedure per step:
1. Tokenize a name, wrap with BOS tokens
2. Forward each token through the model, collecting losses
3. Average the losses
4. Backpropagate gradients through the entire computation graph
5. Adam update on all parameters
6. Zero gradients

5. Inference

After training, the model generates 20 new names by:

Starting with BOS token
Running the forward pass to get logits
Applying temperature scaling (temperature=0.5)
Sampling next token from the probability distribution
Repeating until BOS is generated or max length reached

How to run

Prerequisites

Rust (edition 2021+)
Internet connection (for first-run dataset download)

Build & run

# Clone the repo
git clone <repo-url> && cd microgpt

# Run (release mode recommended — ~2 min for 1000 steps)
cargo run --release

Expected output

num docs: 32033
vocab size: 27
num params: 4720
step 1000 / 1000 | loss 2.0378
--- inference (new, hallucinated names) ---
sample  1: rirela
sample  2: rarnil
sample  3: emile
sample  4: jama
...

Note: This is pure scalar autograd — every multiplication is a separate graph node. It's intentionally slow to emphasize clarity over performance. The ~2 minute runtime (release mode) is expected.

Configuration

You can tweak the model hyperparameters directly in main.rs:

Parameter	Default	Description
`n_layer`	1	Number of transformer layers
`n_embd`	16	Embedding dimension
`block_size`	16	Max context length
`n_head`	4	Number of attention heads
`num_steps`	1000	Training iterations
`learning_rate`	0.01	Initial learning rate
`temperature`	0.5	Inference sampling temperature (0,1]

Project structure

microgpt/
├── Cargo.toml      # Dependencies: rand, ureq
├── LICENSE          # MIT License
├── README.md       # You are here
├── src/
│   └── main.rs     # The entire implementation (~580 lines)
└── input.txt       # Downloaded on first run (names dataset)

Dependencies

Only two crates, both lightweight:

Crate	Purpose
`rand` 0.8	Seeded RNG, weighted sampling for inference
`ureq` 2	Blocking HTTP to download the dataset

Credits

Original Python implementation: Andrej Karpathy — microgpt.py
Dataset: makemore/names.txt

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MicroGPT — Rust 🦀

What is this?

How it works

1. Autograd Engine (`ValueRef`)

2. Dataset & Tokenizer

3. GPT Model

4. Training

5. Inference

How to run

Prerequisites

Build & run

Expected output

Configuration

Project structure

Dependencies

Credits

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MicroGPT — Rust 🦀

What is this?

How it works

1. Autograd Engine (ValueRef)

2. Dataset & Tokenizer

3. GPT Model

4. Training

5. Inference

How to run

Prerequisites

Build & run

Expected output

Configuration

Project structure

Dependencies

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Autograd Engine (`ValueRef`)

Packages