Skip to content

Yashjain0099/Small_Language_Model_from_Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– Small Language Model from Scratch

Build and train your own GPT-style transformer from scratch in PyTorch

Open In Colab Python PyTorch


๐Ÿ“‹ Table of Contents


๐ŸŽฏ What This Is

A complete GPT-style language model built entirely from scratch using PyTorch. No black boxes - every component is implemented and explained.

Perfect for:

  • ๐ŸŽ“ Learning how transformers actually work
  • ๐Ÿ”ฌ Experimenting with language model architectures
  • ๐Ÿš€ Building your own AI text generation projects

โœจ Key Features

Feature Description Status
๐Ÿง  Custom Transformer Multi-head attention, feed-forward networks, layer norm โœ… Complete
๐Ÿ”ค Smart Tokenization GPT-2 BPE tokenizer via tiktoken โœ… Complete
โšก Fast Training Mixed precision, gradient accumulation, CUDA support โœ… Complete
๐ŸŽจ Text Generation Temperature sampling, top-k filtering โœ… Complete
โ˜๏ธ Colab Ready One-click deployment in Google Colab โœ… Complete

๐Ÿ—๏ธ How It Works

graph LR
    A[๐Ÿ“ Text Input] --> B[๐Ÿ”ค Tokenizer]
    B --> C[๐Ÿงฎ Embeddings]
    C --> D[๐Ÿ”„ Transformer Blocks]
    D --> E[๐ŸŽฏ Output Layer]
    E --> F[๐Ÿ“– Generated Text]
    
    subgraph "๐Ÿ”„ Transformer Block"
        G[๐ŸŽฏ Self-Attention] --> H[โž• Add & Norm]
        H --> I[๐Ÿง  Feed Forward]
        I --> J[โž• Add & Norm]
    end
    
    style A fill:#e3f2fd
    style F fill:#e8f5e8
    style G fill:#fff3e0
    style I fill:#f3e5f5
Loading

The Process:

  1. Tokenize text using GPT-2's tokenizer
  2. Embed tokens and add position information
  3. Transform through multiple attention layers
  4. Generate next token predictions
  5. Sample from predictions to create new text

๐Ÿš€ Quick Start

Option 1: Google Colab (Recommended) โ˜๏ธ

Click the badge above โ†’ Run all cells โ†’ Start generating text!

Option 2: Local Setup ๐Ÿ’ป

# Clone the repo
git clone https://github.com/Yashjain0099/Small_Language_Model_from_Scratch.git
cd Small_Language_Model_from_Scratch

# Install dependencies  
pip install torch tiktoken numpy matplotlib tqdm

# Run the notebook
jupyter notebook SLM_\(scratch\).ipynb

๐Ÿ’ป Usage Examples

๐ŸŽฏ Basic Text Generation

# Load your trained model
model = SmallLanguageModel()
model.load_state_dict(torch.load('model.pth'))

# Generate text
prompt = "A little girl went to the woods"
output = model.generate(prompt, max_length=50, temperature=0.8)
print(output)

๐ŸŽจ Creative vs Focused Generation

# Creative mode (higher temperature)
creative = model.generate("Once upon a time", temperature=1.2, top_k=50)

# Focused mode (lower temperature)  
focused = model.generate("The capital of France is", temperature=0.3, top_k=10)

๐Ÿ”ง Training Your Own Model

# Quick training setup
trainer = LanguageModelTrainer(model)
trainer.train(
    data_path="your_text_data.txt",
    batch_size=16,
    learning_rate=3e-4,
    epochs=5
)

๐Ÿ“Š Results

๐Ÿ“ Sample Generation

Input: "A little girl went to the woods"

Output:

A little girl went to the woods and he was looking at the animals and he saw 
a little boy with a big smile on its face. He knew she would never bring 
medicine before. One day, the girl called Jeff went for a walk...

๐Ÿ“ˆ Training Progress

Training Loss Model learns to predict text better over time

โšก Performance Stats

  • Model Size: ~25M parameters
  • Training Time: ~2 hours on GPU
  • Generation Speed: 50+ tokens/second
  • Memory Usage: <4GB GPU memory

๐Ÿ› ๏ธ Project Structure

๐Ÿ“ฆ Small-Language-Model
โ”œโ”€โ”€ ๐Ÿ“œ SLM_(scratch).ipynb    # Main notebook with everything
โ”œโ”€โ”€ ๐Ÿ“„ README.md              # This file
โ””โ”€โ”€ ๐Ÿ“ images/                # Screenshots and plots
    โ”œโ”€โ”€ training_progress.png
    โ”œโ”€โ”€ model_architecture.png
    โ””โ”€โ”€ generation_examples.png

๐Ÿงฉ Code Components

Component What It Does Lines of Code
Tokenizer Converts text โ†” numbers ~50 lines
Model Transformer architecture ~200 lines
Training Loss calculation & optimization ~100 lines
Generation Text sampling & generation ~80 lines

๐ŸŽ“ What You'll Learn

๐Ÿ” Core Concepts

  • โœ… How self-attention actually works
  • โœ… Why transformers are so powerful
  • โœ… How language models generate text
  • โœ… Modern training techniques (mixed precision, scheduling)

๐Ÿง  Technical Skills

  • โœ… Building neural networks from scratch
  • โœ… Implementing attention mechanisms
  • โœ… Training large models efficiently
  • โœ… Text generation and sampling methods

๐ŸŽฏ Model Configuration

# Default model settings
MODEL_CONFIG = {
    'vocab_size': 50257,      # GPT-2 vocabulary
    'd_model': 512,           # Hidden dimension
    'n_heads': 8,             # Attention heads
    'n_layers': 6,            # Transformer layers
    'max_seq_len': 1024,      # Maximum sequence length
}

Want bigger models? Just increase the parameters:

  • ๐Ÿ“ฑ Tiny: 256 dim, 4 heads, 4 layers (~6M params)
  • ๐Ÿ–ฅ๏ธ Small: 512 dim, 8 heads, 6 layers (~25M params)
  • ๐Ÿš€ Medium: 768 dim, 12 heads, 12 layers (~85M params)

๐Ÿ”ง Advanced Features

โšก Training Optimizations

  • Mixed Precision: 2x faster training
  • Gradient Accumulation: Larger effective batch sizes
  • Learning Rate Scheduling: Warmup + cosine decay
  • Checkpointing: Resume training anytime

๐ŸŽจ Generation Options

  • Temperature: Control randomness (0.1 = boring, 1.5 = wild)
  • Top-K: Only sample from K most likely tokens
  • Max Length: Control output length
  • Repetition Penalty: Avoid repetitive text

๐Ÿค Contributing

Found a bug? Want to add features? Contributions welcome!

  1. Fork the repo
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

Ideas for contributions:

  • ๐ŸŽฏ Different attention mechanisms
  • ๐Ÿ“Š Better evaluation metrics
  • ๐ŸŽจ New generation techniques
  • ๐Ÿ“š More example datasets
  • ๐Ÿ› Bug fixes and improvements

๐Ÿ“š Learn More

๐Ÿ“– Helpful Resources

๐ŸŽฌ Video Walkthrough

Coming soon: Full video explanation of the code!


๐Ÿ“ž Contact

Yash Jain

  • ๐Ÿ™ GitHub: @Yashjain0099
  • ๐Ÿ“ง Questions? Open an issue!

โญ Show Your Support

If this helped you understand language models:

  • โญ Star the repository
  • ๐Ÿด Fork for your experiments
  • ๐Ÿ“ข Share with friends
  • ๐Ÿ› Report bugs you find

Built with โค๏ธ for learning and understanding AI

Ready to dive in? Click the Colab badge and start experimenting! ๐Ÿš€

About

A Transformer-based Small Language Model (SLM) , built from scratch in PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors