Build and train your own GPT-style transformer from scratch in PyTorch
- ๐ฏ What This Is
- โจ Key Features
- ๐๏ธ How It Works
- ๐ Quick Start
- ๐ป Usage Examples
- ๐ Results
- ๐ ๏ธ Project Structure
- ๐ค Contributing
A complete GPT-style language model built entirely from scratch using PyTorch. No black boxes - every component is implemented and explained.
Perfect for:
- ๐ Learning how transformers actually work
- ๐ฌ Experimenting with language model architectures
- ๐ Building your own AI text generation projects
| Feature | Description | Status |
|---|---|---|
| ๐ง Custom Transformer | Multi-head attention, feed-forward networks, layer norm | โ Complete |
| ๐ค Smart Tokenization | GPT-2 BPE tokenizer via tiktoken | โ Complete |
| โก Fast Training | Mixed precision, gradient accumulation, CUDA support | โ Complete |
| ๐จ Text Generation | Temperature sampling, top-k filtering | โ Complete |
| โ๏ธ Colab Ready | One-click deployment in Google Colab | โ Complete |
graph LR
A[๐ Text Input] --> B[๐ค Tokenizer]
B --> C[๐งฎ Embeddings]
C --> D[๐ Transformer Blocks]
D --> E[๐ฏ Output Layer]
E --> F[๐ Generated Text]
subgraph "๐ Transformer Block"
G[๐ฏ Self-Attention] --> H[โ Add & Norm]
H --> I[๐ง Feed Forward]
I --> J[โ Add & Norm]
end
style A fill:#e3f2fd
style F fill:#e8f5e8
style G fill:#fff3e0
style I fill:#f3e5f5
The Process:
- Tokenize text using GPT-2's tokenizer
- Embed tokens and add position information
- Transform through multiple attention layers
- Generate next token predictions
- Sample from predictions to create new text
Click the badge above โ Run all cells โ Start generating text!
# Clone the repo
git clone https://github.com/Yashjain0099/Small_Language_Model_from_Scratch.git
cd Small_Language_Model_from_Scratch
# Install dependencies
pip install torch tiktoken numpy matplotlib tqdm
# Run the notebook
jupyter notebook SLM_\(scratch\).ipynb# Load your trained model
model = SmallLanguageModel()
model.load_state_dict(torch.load('model.pth'))
# Generate text
prompt = "A little girl went to the woods"
output = model.generate(prompt, max_length=50, temperature=0.8)
print(output)# Creative mode (higher temperature)
creative = model.generate("Once upon a time", temperature=1.2, top_k=50)
# Focused mode (lower temperature)
focused = model.generate("The capital of France is", temperature=0.3, top_k=10)# Quick training setup
trainer = LanguageModelTrainer(model)
trainer.train(
data_path="your_text_data.txt",
batch_size=16,
learning_rate=3e-4,
epochs=5
)Input: "A little girl went to the woods"
Output:
A little girl went to the woods and he was looking at the animals and he saw
a little boy with a big smile on its face. He knew she would never bring
medicine before. One day, the girl called Jeff went for a walk...
Model learns to predict text better over time
- Model Size: ~25M parameters
- Training Time: ~2 hours on GPU
- Generation Speed: 50+ tokens/second
- Memory Usage: <4GB GPU memory
๐ฆ Small-Language-Model
โโโ ๐ SLM_(scratch).ipynb # Main notebook with everything
โโโ ๐ README.md # This file
โโโ ๐ images/ # Screenshots and plots
โโโ training_progress.png
โโโ model_architecture.png
โโโ generation_examples.png
| Component | What It Does | Lines of Code |
|---|---|---|
| Tokenizer | Converts text โ numbers | ~50 lines |
| Model | Transformer architecture | ~200 lines |
| Training | Loss calculation & optimization | ~100 lines |
| Generation | Text sampling & generation | ~80 lines |
- โ How self-attention actually works
- โ Why transformers are so powerful
- โ How language models generate text
- โ Modern training techniques (mixed precision, scheduling)
- โ Building neural networks from scratch
- โ Implementing attention mechanisms
- โ Training large models efficiently
- โ Text generation and sampling methods
# Default model settings
MODEL_CONFIG = {
'vocab_size': 50257, # GPT-2 vocabulary
'd_model': 512, # Hidden dimension
'n_heads': 8, # Attention heads
'n_layers': 6, # Transformer layers
'max_seq_len': 1024, # Maximum sequence length
}Want bigger models? Just increase the parameters:
- ๐ฑ Tiny: 256 dim, 4 heads, 4 layers (~6M params)
- ๐ฅ๏ธ Small: 512 dim, 8 heads, 6 layers (~25M params)
- ๐ Medium: 768 dim, 12 heads, 12 layers (~85M params)
- Mixed Precision: 2x faster training
- Gradient Accumulation: Larger effective batch sizes
- Learning Rate Scheduling: Warmup + cosine decay
- Checkpointing: Resume training anytime
- Temperature: Control randomness (0.1 = boring, 1.5 = wild)
- Top-K: Only sample from K most likely tokens
- Max Length: Control output length
- Repetition Penalty: Avoid repetitive text
Found a bug? Want to add features? Contributions welcome!
- Fork the repo
- Create a feature branch
- Make your changes
- Submit a pull request
Ideas for contributions:
- ๐ฏ Different attention mechanisms
- ๐ Better evaluation metrics
- ๐จ New generation techniques
- ๐ More example datasets
- ๐ Bug fixes and improvements
- The Illustrated Transformer - Visual explanation
- Attention Is All You Need - Original paper
- Let's Build GPT - Andrej Karpathy's tutorial
Coming soon: Full video explanation of the code!
Yash Jain
- ๐ GitHub: @Yashjain0099
- ๐ง Questions? Open an issue!
If this helped you understand language models:
- โญ Star the repository
- ๐ด Fork for your experiments
- ๐ข Share with friends
- ๐ Report bugs you find
Built with โค๏ธ for learning and understanding AI
Ready to dive in? Click the Colab badge and start experimenting! ๐