Skip to content

rp-bg/microgpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MicroGPT — Rust 🦀

A faithful Rust port of @karpathy's microgpt.py — the most atomic way to train and run inference for a GPT, now in pure Rust with zero ML framework dependencies.

"This file is the complete algorithm. Everything else is just efficiency."

What is this?

A single-file, from-scratch implementation of a GPT language model that:

  1. Downloads a dataset of ~32K human names
  2. Trains a character-level transformer to learn the patterns
  3. Generates new, hallucinated names

Everything is built from first principles — scalar autograd, transformer architecture, Adam optimizer — with no ML libraries.

How it works

The entire implementation lives in src/main.rs and mirrors the structure of the original Python script. Here's what each piece does:

1. Autograd Engine (ValueRef)

The foundation of the system. Every number in the model is wrapped in a ValueRef — a reference-counted smart pointer (Rc<RefCell<ValueInner>>) to a node in a computation graph.

ValueInner {
    data: f64,        // the scalar value (forward pass)
    grad: f64,        // ∂loss/∂this_node (backward pass)
    children: Vec,    // parent nodes in the graph
    local_grads: Vec, // local derivatives w.r.t. children
}

Supported operations (all tracked for backprop):

  • Arithmetic: +, -, *, / via Rust operator overloading (Add, Mul, Sub, Div, Neg traits)
  • Functions: pow, log, exp, relu

Backward pass (backward()): Performs an iterative DFS to build a topological ordering of the graph, then walks it in reverse to accumulate gradients via the chain rule.

2. Dataset & Tokenizer

  • Dataset: Downloads names.txt (~32K names) on first run, caches it as input.txt
  • Tokenizer: Character-level — each unique character gets an integer ID (0..n-1), plus a special BOS (Beginning of Sequence) token

3. GPT Model

A simplified GPT-2 architecture with a few deliberate simplifications:

Component Detail
Normalization RMSNorm (instead of LayerNorm)
Activation ReLU (instead of GeLU)
Biases None
Layers 1 transformer layer
Embedding dim 16
Attention heads 4 (head_dim = 4)
Context length 16 tokens

Model functions:

  • linear(x, W) — matrix-vector multiply (each row of W is a neuron)
  • softmax(logits) — numerically-stable softmax with max subtraction
  • rmsnorm(x) — root-mean-square normalization
  • gpt(token, pos, kv_cache, ...) — full forward pass for one token:
    1. Embed token + position
    2. RMSNorm
    3. Multi-head self-attention (with KV cache)
    4. Residual connection
    5. MLP (fc1 → ReLU → fc2)
    6. Residual connection
    7. Project to logits via lm_head

4. Training

  • Optimizer: Adam (lr=0.01, β₁=0.85, β₂=0.99, ε=1e-8) with linear learning rate decay
  • Loss: Cross-entropy (negative log probability of the correct next token)
  • Steps: 1000 iterations, each processing one name
  • Procedure per step:
    1. Tokenize a name, wrap with BOS tokens
    2. Forward each token through the model, collecting losses
    3. Average the losses
    4. Backpropagate gradients through the entire computation graph
    5. Adam update on all parameters
    6. Zero gradients

5. Inference

After training, the model generates 20 new names by:

  1. Starting with BOS token
  2. Running the forward pass to get logits
  3. Applying temperature scaling (temperature=0.5)
  4. Sampling next token from the probability distribution
  5. Repeating until BOS is generated or max length reached

How to run

Prerequisites

  • Rust (edition 2021+)
  • Internet connection (for first-run dataset download)

Build & run

# Clone the repo
git clone <repo-url> && cd microgpt

# Run (release mode recommended — ~2 min for 1000 steps)
cargo run --release

Expected output

num docs: 32033
vocab size: 27
num params: 4720
step 1000 / 1000 | loss 2.0378
--- inference (new, hallucinated names) ---
sample  1: rirela
sample  2: rarnil
sample  3: emile
sample  4: jama
...

Note: This is pure scalar autograd — every multiplication is a separate graph node. It's intentionally slow to emphasize clarity over performance. The ~2 minute runtime (release mode) is expected.

Configuration

You can tweak the model hyperparameters directly in main.rs:

Parameter Default Description
n_layer 1 Number of transformer layers
n_embd 16 Embedding dimension
block_size 16 Max context length
n_head 4 Number of attention heads
num_steps 1000 Training iterations
learning_rate 0.01 Initial learning rate
temperature 0.5 Inference sampling temperature (0,1]

Project structure

microgpt/
├── Cargo.toml      # Dependencies: rand, ureq
├── LICENSE          # MIT License
├── README.md       # You are here
├── src/
│   └── main.rs     # The entire implementation (~580 lines)
└── input.txt       # Downloaded on first run (names dataset)

Dependencies

Only two crates, both lightweight:

Crate Purpose
rand 0.8 Seeded RNG, weighted sampling for inference
ureq 2 Blocking HTTP to download the dataset

Credits

License

This project is licensed under the MIT License — see the LICENSE file for details.

About

Rust port of @karpathy's microgpt.py. The most atomic way to train and run inference for a GPT, now in pure Rust with zero ML framework dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages