Semantic Search Engine from Scratch

Week 2 Learning Track: Building Intuition and Hands-On Skills

A complete, educational implementation of a semantic search system. Learn how embeddings, similarity metrics, chunking strategies, and vector databases work together to power modern retrieval systems.

🎯 Quick Start (< 10 minutes)

1. Prerequisites

Python 3.8+
Ollama installed and running
~2GB disk space for embeddings index

2. Setup

# Clone and navigate
cd /path/to/RAG_101

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy example config
cp .env.example .env

3. Start Ollama

In a separate terminal:

# Start Ollama server
ollama serve

# In another terminal, pull the embedding model
ollama pull nomic-embed-text

4. Run the App

streamlit run app.py

This opens a web app at http://localhost:8501

5. Try It Out

Click "Index Documents" tab
Click "📂 Index Documents" button (uses sample documents)
Click "🔎 Search" tab
Try queries like:
- "What are embeddings?"
- "How does machine learning work?"
- "What is a vector database?"

Done! You've built a semantic search system.

📚 Understanding the System

What You're Building

A complete semantic search pipeline:

Documents → Chunks → Embeddings → Vector DB → Search Results

Documents: PDF, TXT, MD files
Chunks: Break into semantic units
Embeddings: Convert to vectors via Ollama
Vector DB: Index in ChromaDB
Search: Find most similar chunks

Key Concepts

🧮 Embeddings

Convert text → 768-dimensional vectors
Similar text = similar vectors
Enabled by Ollama + nomic-embed-text

Why embeddings?

Enable semantic similarity calculations
Capture meaning, not just keywords
Much cheaper than API calls (free!)

🔍 Similarity Search

Measure distance between vectors
Cosine similarity: angle between vectors
Returns top-k most similar results

Why cosine similarity?

Works regardless of vector magnitude
Scale-invariant
Industry standard for NLP

✂️ Chunking

Break documents into manageable pieces
Trade-off: smaller chunks vs. context

Chunk size impact:

Too small: loses context
Too large: less relevant results
Overlap: improves recall but increases storage

🗄️ Vector Database

Efficient storage of embeddings
Fast similarity search via HNSW indexing
Metadata support for filtering

Why ChromaDB?

Free, lightweight, Python-native
No separate server needed
Perfect for learning and small-to-medium projects
Easily swappable for Pinecone/Weaviate at scale

🏗️ Project Structure

RAG_101/
├── src/                          # Core modules
│   ├── __init__.py
│   ├── config.py                # Configuration
│   ├── ingestion.py             # Load PDF/TXT/MD
│   ├── chunking.py              # Split text strategically
│   ├── embeddings.py            # Generate embeddings via Ollama
│   ├── similarity.py            # Cosine, dot product, L2 metrics
│   ├── vector_store.py          # ChromaDB integration
│   └── search_engine.py         # Main orchestrator
├── data/
│   ├── documents/               # Put your PDFs/TXT/MD here
│   │   ├── machine_learning_intro.md
│   │   ├── embeddings_guide.md
│   │   └── vector_databases.md
│   └── chroma_db/               # ChromaDB storage (auto-created)
├── app.py                       # Streamlit web interface
├── requirements.txt             # Python dependencies
├── .env.example                 # Configuration template
├── README.md                    # This file
└── LEARNING_GUIDE.md            # Detailed educational content

🔧 Configuration

Edit .env to customize behavior:

# Ollama Configuration
OLLAMA_MODEL=nomic-embed-text                    # Embedding model
OLLAMA_BASE_URL=http://localhost:11434           # Ollama server URL

# ChromaDB Configuration
CHROMA_DB_PATH=./data/chroma_db                  # Where to store embeddings

# Search Configuration
TOP_K=5                                          # Number of results
CHUNK_SIZE=500                                   # Characters per chunk
CHUNK_OVERLAP=100                                # Overlap between chunks

Configuration Trade-offs

Setting	Impact
CHUNK_SIZE	Larger = more context but fewer results; Smaller = more granular but less context
CHUNK_OVERLAP	Larger = better recall but more storage; 0 = minimal storage
TOP_K	Larger = more results but slower; Smaller = faster but might miss relevant docs
OLLAMA_MODEL	Different models trade quality vs. speed

📖 Usage Examples

Via Web Interface (Streamlit)

streamlit run app.py

Three tabs:

📤 Index Documents: Upload documents, configure chunking
🔎 Search: Natural language queries
📊 Stats: View index statistics

Via Python Code

from src.search_engine import SemanticSearchEngine
from src.config import Config

# Initialize engine
engine = SemanticSearchEngine(
    persist_dir=Config.CHROMA_DB_PATH,
    embedding_provider="ollama",
    chunking_strategy="fixed",
    chunk_size=500,
    chunk_overlap=100,
    top_k=5
)

# Index documents
stats = engine.index_documents("./data/documents")
print(f"Indexed {stats['chunks_created']} chunks")

# Search
results = engine.search("What are embeddings?")

# Display results
for result in results:
    print(f"Score: {result['similarity_score']}")
    print(f"Source: {result['source_document']}")
    print(f"Text: {result['text'][:100]}...")
    print("---")

Command Line (if you add a CLI)

python -m src.cli index ./data/documents
python -m src.cli search "Your query here"

🧪 Testing Different Configurations

Experiment 1: Chunk Size Impact

# Test different chunk sizes
for size in 200 500 1000; do
    sed -i "s/CHUNK_SIZE=.*/CHUNK_SIZE=$size/" .env
    streamlit run app.py  # Re-run and observe
done

Expected findings:

Smaller chunks: More results, more specific
Larger chunks: Fewer results, more context
Find your sweet spot (usually 300-700)

Experiment 2: Overlap Impact

# Test impact of overlap
python -c "
from src.search_engine import SemanticSearchEngine

for overlap in [0, 50, 150]:
    engine = SemanticSearchEngine(chunk_overlap=overlap)
    engine.index_documents('./data/documents')
    results = engine.search('machine learning')
    print(f'Overlap {overlap}: {len(results)} results')
"

Expected findings:

No overlap: Fewer chunks, potential gaps
High overlap: More chunks, better coverage
Storage trade-off matters at scale

Experiment 3: Similarity Metrics

from src.search_engine import SemanticSearchEngine

# Compare metrics
for metric in ["cosine", "dot_product", "euclidean"]:
    engine = SemanticSearchEngine(similarity_metric=metric)
    results = engine.search("embeddings")
    # Compare result rankings

📊 How It Works: Data Flow

Indexing Flow

┌──────────────────────────────────────────────────────────┐
│ 1. INGESTION (ingestion.py)                              │
│    Load PDF/TXT/MD → Raw text                            │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 2. CHUNKING (chunking.py)                                │
│    Split text → Fixed-size pieces with overlap           │
│    E.g., 500 chars/chunk, 100 char overlap               │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 3. EMBEDDINGS (embeddings.py)                            │
│    Each chunk → 768-dim vector                           │
│    Via Ollama + nomic-embed-text                         │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 4. VECTOR STORE (vector_store.py)                        │
│    Store embeddings + metadata in ChromaDB               │
│    Build HNSW indexes for fast search                    │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 5. PERSISTENCE                                           │
│    Save to ./data/chroma_db                              │
│    Load on startup (no re-indexing needed)               │
└──────────────────────────────────────────────────────────┘

Search Flow

┌──────────────────────────────────────────────────────────┐
│ 1. USER QUERY                                            │
│    "What is semantic search?"                            │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 2. EMBED QUERY (embeddings.py)                           │
│    Query → 768-dim vector (same space as documents)      │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 3. SEARCH (vector_store.py)                              │
│    Find top-K vectors closest to query                   │
│    Use HNSW index (fast approximate search)              │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 4. SIMILARITY (similarity.py)                            │
│    Calculate cosine similarity scores                    │
│    Range: 0.0 (unrelated) to 1.0 (identical)            │
└────────────────┬─────────────────────────────────────────┘
                 ▼
┌──────────────────────────────────────────────────────────┐
│ 5. RETURN RESULTS                                        │
│    [                                                     │
│      {score: 0.87, source: "embeddings_guide.md", ...},│
│      {score: 0.81, source: "ml_intro.md", ...},         │
│      ...                                                │
│    ]                                                     │
└──────────────────────────────────────────────────────────┘

🎓 Key Learnings

1. How Semantic Similarity Works

Text → vectors (embeddings)
Similar meaning → vectors close together
Distance between vectors = dissimilarity score
Cosine similarity = industry standard

2. Why Chunking Matters

Embeddings work better on focused text
Smaller chunks → more retrieval results
Overlap → improves recall
Trade-off between granularity and context

3. Vector Databases Are Essential

Can't do similarity search efficiently with SQL
Need specialized indexes (HNSW, IVF, etc.)
ChromaDB provides this with minimal setup
Scales from thousands to millions of vectors

4. Strengths of Semantic Search

✅ Finds semantically related content
✅ Works across different phrasings
✅ Fast at scale (with proper indexing)
✅ No need for manual tagging/labels

5. Limitations to Know

❌ Semantic ambiguity (query could mean multiple things)
❌ False positives (unrelated but similar vectors)
❌ Struggles with negation ("NOT machine learning" still retrieves ML)
❌ Requires good embeddings model

🚀 Extension Ideas

1. Add More Embedding Models

# In src/embeddings.py, add support for:
- mxbai-embed-large (larger, better quality)
- all-MiniLM-L6-v2 (smaller, faster)
- Custom fine-tuned models

2. Implement Reranking

# After initial search, rerank results with:
- Different similarity metric
- Cross-encoder model
- Custom scoring function

3. Add LLM Answer Generation

# Use retrieved chunks to generate answers:
- Integrate with local LLM (Ollama)
- Create RAG (Retrieval Augmented Generation) pipeline
- Combine search + generation for QA

4. Web UI Improvements

# Enhance Streamlit app:
- File upload for documents
- Show chunk relationships
- Visualize embedding space (UMAP/t-SNE)
- Export results as PDF

5. Scale to Production

# Move to production:
- Switch to Pinecone for larger scale
- Add API endpoints (FastAPI)
- Implement caching
- Add authentication
- Set up continuous indexing

🔧 Troubleshooting

"Cannot connect to Ollama"

Symptom: RuntimeError: Cannot connect to Ollama at http://localhost:11434

Solutions:

# 1. Check Ollama is running
ollama serve

# 2. Verify connectivity
curl http://localhost:11434/api/tags

# 3. Check .env has correct URL
grep OLLAMA_BASE_URL .env

# 4. On macOS, try different host
# Edit .env: OLLAMA_BASE_URL=http://127.0.0.1:11434

"Model not found"

Symptom: Error: Model 'nomic-embed-text' not found

Solutions:

# Pull the model
ollama pull nomic-embed-text

# Check installed models
ollama list

# Try different model
# Edit .env: OLLAMA_MODEL=all-MiniLM-L6-v2
# Then: ollama pull all-MiniLM-L6-v2

"No documents found"

Symptom: Index succeeds but finds 0 documents

Solutions:

# 1. Check directory exists
ls ./data/documents

# 2. Add sample documents
# Already included in this repo

# 3. Check file extensions
# Must be .pdf, .txt, .md (case-sensitive on Linux/Mac)

"Search returns no results"

Symptom: Search runs but no results found

Solutions:

# 1. Check index isn't empty
# Use Stats tab in Streamlit

# 2. Try simpler query
# Complex queries might not match anything

# 3. Reduce similarity threshold
# Modify vector_store.py to return lower-scored results

# 4. Check ChromaDB folder
ls -la ./data/chroma_db

"Search is very slow"

Symptom: Queries take >5 seconds

Solutions:

# 1. Reduce chunk count
# Edit .env: TOP_K=3 (instead of 5)

# 2. Smaller chunks (faster indexing)
# Edit .env: CHUNK_SIZE=300 (instead of 500)

# 3. Reduce overlap
# Edit .env: CHUNK_OVERLAP=0 (instead of 100)

# 4. Check Ollama isn't overloaded
# Look at Ollama server console

📈 Performance Metrics

On typical modern hardware:

Operation	Typical Time	Notes
Embed 1000 chunks	30-60 seconds	Depends on Ollama hardware
Index in ChromaDB	5-10 seconds	1000 chunks
Single query	100-500ms	Depends on index size
Full pipeline (100 doc)	2-5 minutes	Single thread

📝 Example Queries to Try

Use the sample documents included (machine_learning_intro.md, embeddings_guide.md, vector_databases.md):

Semantic Match Queries

"What are embeddings?"
"How do neural networks work?"
"Tell me about vector databases"

Cross-document Queries

"What is the relationship between embeddings and machine learning?"
"How do vector databases enable semantic search?"

Challenging Queries

"Benefits of semantic search"
"Machine learning vs deep learning"
"Embeddings in production"

Likely False Positives

"Not machine learning"  # Will still find ML content
"Opposite of clustering"  # Might confuse

🏆 Design Decisions & Trade-offs

1. Fixed-Size Chunking vs Semantic Chunking

Chose: Fixed-size
Why: Simpler, more predictable, good learning tool
Alternative: Semantic would preserve meaning better but requires sentence detection

2. Ollama vs OpenAI Embeddings

Chose: Ollama (local)
Why: Free, private, no API dependency, perfect for learning
Alternative: OpenAI better quality but costs money

3. ChromaDB vs Pinecone/Weaviate

Chose: ChromaDB
Why: No separate server, lightweight, good for learning
Alternative: Pinecone scales better but less educational

4. Streamlit vs FastAPI/Flask

Chose: Streamlit
Why: Rapid development, great for demos, easy to understand
Alternative: FastAPI better for production APIs

5. Cosine vs Dot Product Similarity

Chose: Cosine
Why: Scale-invariant, industry standard, more intuitive
Alternative: Dot product faster if vectors normalized

📚 Further Reading

Core Concepts

Implementation Details

Advanced Topics

🤝 Contributing

This is a learning project. Feel free to:

Modify configurations and experiment
Add new chunking strategies
Implement different similarity metrics
Add new embedding providers
Improve the Streamlit UI

📄 License

MIT License - feel free to use for learning and projects

🎓 What's Next?

Week 3: Multi-hop retrieval and query expansion
Week 4: Reranking and relevance optimization
Week 5: RAG (Retrieval Augmented Generation) with LLMs
Week 6: Production deployment and scaling

❓ Questions?

Refer to:

LEARNING_GUIDE.md - Detailed educational content
Code comments - Each module has detailed docstrings
This README - Overview and troubleshooting

Happy learning! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
chroma_notebook_db/7d2fca7e-340f-4ac9-94a0-c053dc005642		chroma_notebook_db/7d2fca7e-340f-4ac9-94a0-c053dc005642
data/documents		data/documents
src		src
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Advanced_RAG_System.ipynb		Advanced_RAG_System.ipynb
DELIVERY_SUMMARY.md		DELIVERY_SUMMARY.md
EXAMPLE_QUERIES.md		EXAMPLE_QUERIES.md
FILE_LISTING.md		FILE_LISTING.md
GETTING_STARTED.md		GETTING_STARTED.md
INDEX.md		INDEX.md
LEARNING_GUIDE.md		LEARNING_GUIDE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
RAG_Benchmark_Challenge.ipynb		RAG_Benchmark_Challenge.ipynb
README.md		README.md
START_HERE.md		START_HERE.md
Semantic_Search_Complete_Learning.ipynb		Semantic_Search_Complete_Learning.ipynb
app.py		app.py
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt

PyExtreme/RAG_101

Folders and files

Latest commit

History

Repository files navigation

Semantic Search Engine from Scratch

🎯 Quick Start (< 10 minutes)

1. Prerequisites

2. Setup

3. Start Ollama

4. Run the App

5. Try It Out

📚 Understanding the System

What You're Building

Key Concepts

🧮 Embeddings

🔍 Similarity Search

✂️ Chunking

🗄️ Vector Database

🏗️ Project Structure

🔧 Configuration

Configuration Trade-offs

📖 Usage Examples

Via Web Interface (Streamlit)

Via Python Code

Command Line (if you add a CLI)

🧪 Testing Different Configurations

Experiment 1: Chunk Size Impact

Experiment 2: Overlap Impact

Experiment 3: Similarity Metrics

📊 How It Works: Data Flow

Indexing Flow

Search Flow

🎓 Key Learnings

1. How Semantic Similarity Works

2. Why Chunking Matters

3. Vector Databases Are Essential

4. Strengths of Semantic Search

5. Limitations to Know

🚀 Extension Ideas

1. Add More Embedding Models

2. Implement Reranking

3. Add LLM Answer Generation

4. Web UI Improvements

5. Scale to Production

🔧 Troubleshooting

"Cannot connect to Ollama"

"Model not found"

"No documents found"

"Search returns no results"

"Search is very slow"

📈 Performance Metrics

📝 Example Queries to Try

Semantic Match Queries

Cross-document Queries

Challenging Queries

Likely False Positives

🏆 Design Decisions & Trade-offs

1. Fixed-Size Chunking vs Semantic Chunking

2. Ollama vs OpenAI Embeddings

3. ChromaDB vs Pinecone/Weaviate

4. Streamlit vs FastAPI/Flask

5. Cosine vs Dot Product Similarity

📚 Further Reading

Core Concepts

Implementation Details

Advanced Topics

🤝 Contributing

📄 License

🎓 What's Next?

❓ Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages