🧠 MicroGPT — The Most Atomic GPT Implementation

🚧 This project is a work in progress. Features, optimizations, and additional language ports may be added over time.

A multi-language port of Andrej Karpathy's beautifully minimal GPT trainer. One file per language. Zero (or near-zero) external dependencies. The complete algorithm — everything else is just efficiency.

📖 Overview

MicroGPT implements a complete GPT-2 style language model from scratch — including autograd, training, and inference — in a single file per language. It trains on a character-level dataset (names by default) and learns to generate novel, hallucinated names.

The project is a faithful port of the original Python implementation across eight languages:

Language	File	Dependencies	Compile / Run
🐍 Python	`microgpt.py`	None (stdlib only)	`python microgpt.py`
⚙️ C++	`microgpt.cpp`	STL only	`g++ -std=c++17 -O2 -o microgpt microgpt.cpp && ./microgpt`
🦀 Rust	`microgpt.rs`	`std` only	`rustc -O microgpt.rs -o microgpt && ./microgpt`
🍎 Swift	`microgpt.swift`	`Foundation` only	`swiftc -O microgpt.swift -o microgpt && ./microgpt`
🎯 Dart	`microgpt.dart`	`dart:io` + `dart:math`	`dart compile exe microgpt.dart -o microgpt && ./microgpt`
🟣 Kotlin	`microgpt.kt`	`kotlin.math` + `java.io`	`kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar`
🟡 JavaScript	`microgpt.js`	`fs` (Node.js built-in)	`node microgpt.js`
🔵 Go	`microgpt.go`	stdlib only	`go run microgpt.go`

Key principle: Every port preserves the exact same structure, variable naming conventions, and algorithmic flow as the original Python. Reading any version is reading the same algorithm.

🏗️ Architecture

The algorithm is structured into six cleanly separated sections, identical across all languages:

1. Data Loading & Tokenization

input.txt → list of documents → character-level tokenizer

Reads input.txt (a newline-separated list of names)
Extracts unique characters as tokens (IDs 0..n-1)
Adds a special BOS (Beginning of Sequence) token as the final ID
Shuffles the dataset with a fixed seed (42) for reproducibility

2. Autograd Engine (`Value`)

A scalar-valued autograd system that tracks computation graphs and computes gradients via reverse-mode differentiation (backpropagation).

Supported operations:

Operation	Python	C++	Rust	Swift	Dart	Kotlin	JavaScript	Go
Addition	`a + b`	`a + b`	`a.add(&b)`	`a + b`	`a + b`	`a + b`	`a.add(b)`	`Add(a,b)`
Multiplication	`a * b`	`a * b`	`a.mul(&b)`	`a * b`	`a * b`	`a * b`	`a.mul(b)`	`Mul(a,b)`
Power	`a ** n`	`vpow(a, n)`	`a.vpow(n)`	`a.vpow(n)`	`a.vpow(n)`	`a.vpow(n)`	`a.pow(n)`	`Pow(a,n)`
Log	`a.log()`	`vlog(a)`	`a.vlog()`	`a.vlog()`	`a.vlog()`	`a.vlog()`	`a.log()`	`Log(a)`
Exp	`a.exp()`	`vexp(a)`	`a.vexp()`	`a.vexp()`	`a.vexp()`	`a.vexp()`	`a.exp()`	`Exp(a)`
ReLU	`a.relu()`	`vrelu(a)`	`a.relu()`	`a.relu()`	`a.relu()`	`a.relu()`	`a.relu()`	`Relu(a)`
Negation	`-a`	`-a`	`a.neg()`	`-a`	`-a`	`-a`	`a.neg()`	`Neg(a)`
Subtraction	`a - b`	`a - b`	`a.sub(&b)`	`a - b`	`a - b`	`a - b`	`a.sub(b)`	`Sub(a,b)`
Division	`a / b`	`a / b`	`a.div(&b)`	`a / b`	`a / b`	`a / b`	`a.div(b)`	`Div(a,b)`

Backward pass uses topological sorting (DFS) to propagate gradients through the computation graph — the chain rule, applied recursively.

3. Parameter Initialization

Weights are initialized as 2D matrices of Value nodes, drawn from a Gaussian distribution (mean=0, std=0.08):

state_dict:
  wte        → [vocab_size × n_embd]      Token embeddings
  wpe        → [block_size × n_embd]      Position embeddings
  lm_head    → [vocab_size × n_embd]      Output projection

  Per layer:
    attn_wq  → [n_embd × n_embd]          Query projection
    attn_wk  → [n_embd × n_embd]          Key projection
    attn_wv  → [n_embd × n_embd]          Value projection
    attn_wo  → [n_embd × n_embd]          Output projection
    mlp_fc1  → [4·n_embd × n_embd]        MLP up-projection
    mlp_fc2  → [n_embd × 4·n_embd]        MLP down-projection

Default hyperparameters:

Parameter	Value	Description
`n_embd`	16	Embedding dimension
`n_head`	4	Number of attention heads
`n_layer`	1	Number of transformer layers
`block_size`	16	Maximum sequence length
`head_dim`	4	Dimension per head (`n_embd / n_head`)

4. Model Architecture

Follows GPT-2 with minor simplifications:

                    ┌──────────────────────────────────┐
                    │         Token + Position         │
                    │           Embedding              │
                    └──────────┬───────────────────────┘
                               │
                          ┌────▼────┐
                          │ RMSNorm │
                          └────┬────┘
                               │
              ┌────────────────▼────────────────┐
              │     Multi-Head Self-Attention   │
              │  (Q, K, V projections + output) │
              │  with causal KV-cache           │
              └────────────────┬────────────────┘
                               │ + residual
                          ┌────▼────┐
                          │ RMSNorm │
                          └────┬────┘
                               │
              ┌────────────────▼────────────────┐
              │          MLP Block              │
              │  FC1 → ReLU → FC2               │
              └────────────────┬────────────────┘
                               │ + residual
                          ┌────▼────┐
                          │ lm_head │
                          └────┬────┘
                               │
                          [ logits ]

Differences from standard GPT-2:

LayerNorm → RMSNorm (no learned scale/bias)
GeLU → ReLU (simpler activation)
No biases anywhere in the model

5. Training Loop

For each step (1..1000):
  1. Pick a document, tokenize it with BOS delimiters
  2. Forward each token through the model (autoregressive)
  3. Compute cross-entropy loss against next-token targets
  4. Backward pass (compute all gradients)
  5. Adam optimizer update with:
     - Linear learning rate decay
     - Bias-corrected first moment (β₁ = 0.85)
     - Bias-corrected second moment (β₂ = 0.99)

Optimizer hyperparameters:

Parameter	Value
Learning rate	0.01 (linearly decayed to 0)
β₁	0.85
β₂	0.99
ε	1e-8
Steps	1000

6. Inference

After training, the model generates 20 new samples:

For each sample:
  1. Start with BOS token
  2. Feed through model to get next-token probabilities
  3. Apply temperature scaling (0.5 — slightly conservative)
  4. Sample from the probability distribution
  5. Repeat until BOS is generated or block_size is reached

⚡ Language-Specific Design Decisions

C++ (`microgpt.cpp`)

Value ownership: std::shared_ptr<Value> enables shared graph node references
Operator overloads: Global operator+, operator*, etc., for natural mathematical syntax
RNG: std::mt19937 (Mersenne Twister) + std::normal_distribution + std::discrete_distribution
Requires: C++17 (-std=c++17) for structured bindings

Rust (`microgpt.rs`)

Value ownership: Rc<RefCell<ValueInner>> — Rust's interior mutability pattern for shared mutable nodes
No external crates: Everything is std-only, including a hand-rolled xoshiro256** PRNG
Gaussian sampling: Box-Muller transform (no external dependency needed)
Weighted sampling: Hand-rolled cumulative distribution function

Swift (`microgpt.swift`)

Value as class: Reference semantics (like Python) via Swift's class keyword
Operator overloads: Global +, *, -, / operators matching Python's syntax
Identity tracking: ObjectIdentifier for the visited set in topological sort
RNG: Hand-rolled xoshiro256** + Box-Muller for Gaussian

Dart (`microgpt.dart`)

Value class: Dart classes are reference types by default (perfect fit)
Operator overloads: operator +, operator *, operator -, operator /
Named methods: vpow(), vlog(), vexp(), relu() for operations Dart can't overload
RNG: Hand-rolled xoshiro256** + Box-Muller for Gaussian
String handling: Character-by-character via index access (string[i])

Kotlin (`microgpt.kt`)

Value class: Kotlin classes are reference types; direct operator overloads (plus, times, minus, div, unaryMinus)
Data class for weights: data class LayerWeights for clean per-layer storage
RNG: Hand-rolled xoshiro256** with Long (64-bit signed) + Box-Muller
Signed 64-bit workaround: SplitMix64 seed constants expressed as signed Long literals

JavaScript (`microgpt.js`)

Value class: ES6 class with method-based API (.add(), .mul(), .sub(), .div())
No operator overloads: JS doesn't support them, so all operations are method calls
RNG: Hand-rolled xoshiro256** using BigInt for 64-bit precision + Box-Muller
Typed arrays: Float64Array for Adam optimizer buffers (memory-efficient)
IIFE pattern: Entire program wrapped in an immediately-invoked function

Go (`microgpt.go`)

Value struct: Pointer-based (*Value) for shared graph references
Free functions: Go has no operator overloading — all ops are top-level functions (Add, Mul, Pow, Log, Exp, Relu, etc.)
RNG: Hand-rolled xoshiro256** with uint64 + Box-Muller for Gaussian
Shuffle: Uses Rng.Shuffle with a swap callback (matches sort.Slice idiom)
Idiomatic Go: Exported fields (Data, Grad), unexported internals (children, localGrads)

📂 Project Structure

microgpt/
├── input.txt           # Training dataset (32K+ names)
├── microgpt.py         # Original Python implementation
├── microgpt.cpp        # C++17 port
├── microgpt.rs         # Rust port
├── microgpt.swift      # Swift port
├── microgpt.dart       # Dart port
├── microgpt.kt         # Kotlin port
├── microgpt.js         # JavaScript (Node.js) port
├── microgpt.go         # Go port
├── benchmark.sh        # All-languages benchmark script
└── README.md           # This file

🚀 Quick Start

Prerequisites

An input.txt file in the working directory (the Python version auto-downloads it; other versions expect it to exist)

To auto-download input.txt:

curl -o input.txt https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt

Run any version

# Python (original)
python microgpt.py

# C++
g++ -std=c++17 -O2 -o microgpt_cpp microgpt.cpp && ./microgpt_cpp

# Rust
rustc -O microgpt.rs -o microgpt_rs && ./microgpt_rs

# Swift
swiftc -O microgpt.swift -o microgpt_swift && ./microgpt_swift

# Dart
dart compile exe microgpt.dart -o microgpt && ./microgpt

# Kotlin
kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar

# JavaScript (Node.js)
node microgpt.js

# Go
go run microgpt.go

Expected Output

num docs: 32033
vocab size: 27
num params: 7185
step    1 / 1000 | loss 3.2958
step    2 / 1000 | loss 3.2730
...
step 1000 / 1000 | loss 2.1042

--- inference (new, hallucinated names) ---
sample  1: Maliya
sample  2: Kori
sample  3: Aden
...

Note: Exact output values may vary slightly between languages due to differences in floating-point arithmetic and RNG implementations, but the overall behavior and convergence should be consistent.

⏱️ Benchmarking

A benchmark script is included that compiles and runs all implementations with maximum optimizations, then prints a ranked comparison table.

bash benchmark.sh

This compiles each language with its best optimization flags (-O3 for C++, -O for Rust/Swift, AOT for Dart, etc.), runs all 1000 training steps, and produces a results table like:

╔═══════════════════════════════════════════════════════════════════════╗
║                        BENCHMARK RESULTS                              ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Rank │ Language              │ Compile (s) │  Run (s)  │ vs Fastest  ║
║  ─────┼───────────────────────┼─────────────┼───────────┼─────────────║
║    #1 │ Rust (rustc -O)       │          3s │      6s   │      1.00x  ║
║    #2 │ C++ (g++ -O3)         │          1s │      7s   │      1.17x  ║
╚═══════════════════════════════════════════════════════════════════════╝

Note: The above is a sample — actual timings depend on your hardware. Missing toolchains are gracefully skipped.

📊 Dependency Comparison

Language	External Deps	Total Imports
Python	0	`os`, `math`, `random`
C++	0	`<iostream>`, `<fstream>`, `<vector>`, `<cmath>`, `<random>`, `<algorithm>`, `<set>`, `<map>`, `<string>`, `<sstream>`, `<memory>`, `<functional>`, `<numeric>`
Rust	0	`std::cell`, `std::collections`, `std::fs`, `std::rc`
Swift	0	`Foundation`
Dart	0	`dart:io`, `dart:math`
Kotlin	0	`java.io.File`, `kotlin.math.*`
JavaScript	0	`fs` (Node.js built-in)
Go	0	`fmt`, `math`, `os`, `bufio`, `sort`, `strings`

Every implementation achieves zero external dependencies — only standard library / language built-ins are used.

🤔 Why This Exists

This project demonstrates that:

A GPT can be implemented in ~200-300 lines in any modern language
Autograd doesn't require a framework — it's just the chain rule + a topological sort
The algorithm is language-agnostic — the same structure maps naturally to C++, Rust, Swift, Dart, Kotlin, JavaScript, and Go
Dependencies are optional — even RNG and Gaussian sampling can be done from scratch

It's a powerful educational tool for understanding transformers at the most fundamental level.

🙏 Credits & Acknowledgments

Massive shoutout to Andrej Karpathy for the original Python implementation. His work on making deep learning accessible — from micrograd to nanoGPT to this beautifully minimal microgpt.py — has been an incredible gift to the ML community. This multi-language port is a tribute to the elegance of his original algorithm. 🎉

"This file is the complete algorithm. Everything else is just efficiency."
— Andrej Karpathy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 MicroGPT — The Most Atomic GPT Implementation

📖 Overview

🏗️ Architecture

1. Data Loading & Tokenization

2. Autograd Engine (`Value`)

3. Parameter Initialization

4. Model Architecture

5. Training Loop

6. Inference

⚡ Language-Specific Design Decisions

C++ (`microgpt.cpp`)

Rust (`microgpt.rs`)

Swift (`microgpt.swift`)

Dart (`microgpt.dart`)

Kotlin (`microgpt.kt`)

JavaScript (`microgpt.js`)

Go (`microgpt.go`)

📂 Project Structure

🚀 Quick Start

Prerequisites

Run any version

Expected Output

⏱️ Benchmarking

📊 Dependency Comparison

🤔 Why This Exists

🙏 Credits & Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
benchmark.sh		benchmark.sh
input.txt		input.txt
microgpt.cpp		microgpt.cpp
microgpt.dart		microgpt.dart
microgpt.go		microgpt.go
microgpt.js		microgpt.js
microgpt.kt		microgpt.kt
microgpt.py		microgpt.py
microgpt.rs		microgpt.rs
microgpt.swift		microgpt.swift

Folders and files

Latest commit

History

Repository files navigation

🧠 MicroGPT — The Most Atomic GPT Implementation

📖 Overview

🏗️ Architecture

1. Data Loading & Tokenization

2. Autograd Engine (Value)

3. Parameter Initialization

4. Model Architecture

5. Training Loop

6. Inference

⚡ Language-Specific Design Decisions

C++ (microgpt.cpp)

Rust (microgpt.rs)

Swift (microgpt.swift)

Dart (microgpt.dart)

Kotlin (microgpt.kt)

JavaScript (microgpt.js)

Go (microgpt.go)

📂 Project Structure

🚀 Quick Start

Prerequisites

Run any version

Expected Output

⏱️ Benchmarking

📊 Dependency Comparison

🤔 Why This Exists

🙏 Credits & Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Autograd Engine (`Value`)

C++ (`microgpt.cpp`)

Rust (`microgpt.rs`)

Swift (`microgpt.swift`)

Dart (`microgpt.dart`)

Kotlin (`microgpt.kt`)

JavaScript (`microgpt.js`)

Go (`microgpt.go`)

Packages