Week 2 Learning Track: Building Intuition and Hands-On Skills
A complete, educational implementation of a semantic search system. Learn how embeddings, similarity metrics, chunking strategies, and vector databases work together to power modern retrieval systems.
- Python 3.8+
- Ollama installed and running
- ~2GB disk space for embeddings index
# Clone and navigate
cd /path/to/RAG_101
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy example config
cp .env.example .envIn a separate terminal:
# Start Ollama server
ollama serve
# In another terminal, pull the embedding model
ollama pull nomic-embed-textstreamlit run app.pyThis opens a web app at http://localhost:8501
- Click "Index Documents" tab
- Click "📂 Index Documents" button (uses sample documents)
- Click "🔎 Search" tab
- Try queries like:
- "What are embeddings?"
- "How does machine learning work?"
- "What is a vector database?"
Done! You've built a semantic search system.
A complete semantic search pipeline:
Documents → Chunks → Embeddings → Vector DB → Search Results
- Documents: PDF, TXT, MD files
- Chunks: Break into semantic units
- Embeddings: Convert to vectors via Ollama
- Vector DB: Index in ChromaDB
- Search: Find most similar chunks
- Convert text → 768-dimensional vectors
- Similar text = similar vectors
- Enabled by Ollama + nomic-embed-text
Why embeddings?
- Enable semantic similarity calculations
- Capture meaning, not just keywords
- Much cheaper than API calls (free!)
- Measure distance between vectors
- Cosine similarity: angle between vectors
- Returns top-k most similar results
Why cosine similarity?
- Works regardless of vector magnitude
- Scale-invariant
- Industry standard for NLP
- Break documents into manageable pieces
- Trade-off: smaller chunks vs. context
Chunk size impact:
- Too small: loses context
- Too large: less relevant results
- Overlap: improves recall but increases storage
- Efficient storage of embeddings
- Fast similarity search via HNSW indexing
- Metadata support for filtering
Why ChromaDB?
- Free, lightweight, Python-native
- No separate server needed
- Perfect for learning and small-to-medium projects
- Easily swappable for Pinecone/Weaviate at scale
RAG_101/
├── src/ # Core modules
│ ├── __init__.py
│ ├── config.py # Configuration
│ ├── ingestion.py # Load PDF/TXT/MD
│ ├── chunking.py # Split text strategically
│ ├── embeddings.py # Generate embeddings via Ollama
│ ├── similarity.py # Cosine, dot product, L2 metrics
│ ├── vector_store.py # ChromaDB integration
│ └── search_engine.py # Main orchestrator
├── data/
│ ├── documents/ # Put your PDFs/TXT/MD here
│ │ ├── machine_learning_intro.md
│ │ ├── embeddings_guide.md
│ │ └── vector_databases.md
│ └── chroma_db/ # ChromaDB storage (auto-created)
├── app.py # Streamlit web interface
├── requirements.txt # Python dependencies
├── .env.example # Configuration template
├── README.md # This file
└── LEARNING_GUIDE.md # Detailed educational content
Edit .env to customize behavior:
# Ollama Configuration
OLLAMA_MODEL=nomic-embed-text # Embedding model
OLLAMA_BASE_URL=http://localhost:11434 # Ollama server URL
# ChromaDB Configuration
CHROMA_DB_PATH=./data/chroma_db # Where to store embeddings
# Search Configuration
TOP_K=5 # Number of results
CHUNK_SIZE=500 # Characters per chunk
CHUNK_OVERLAP=100 # Overlap between chunks| Setting | Impact |
|---|---|
| CHUNK_SIZE | Larger = more context but fewer results; Smaller = more granular but less context |
| CHUNK_OVERLAP | Larger = better recall but more storage; 0 = minimal storage |
| TOP_K | Larger = more results but slower; Smaller = faster but might miss relevant docs |
| OLLAMA_MODEL | Different models trade quality vs. speed |
streamlit run app.pyThree tabs:
- 📤 Index Documents: Upload documents, configure chunking
- 🔎 Search: Natural language queries
- 📊 Stats: View index statistics
from src.search_engine import SemanticSearchEngine
from src.config import Config
# Initialize engine
engine = SemanticSearchEngine(
persist_dir=Config.CHROMA_DB_PATH,
embedding_provider="ollama",
chunking_strategy="fixed",
chunk_size=500,
chunk_overlap=100,
top_k=5
)
# Index documents
stats = engine.index_documents("./data/documents")
print(f"Indexed {stats['chunks_created']} chunks")
# Search
results = engine.search("What are embeddings?")
# Display results
for result in results:
print(f"Score: {result['similarity_score']}")
print(f"Source: {result['source_document']}")
print(f"Text: {result['text'][:100]}...")
print("---")python -m src.cli index ./data/documents
python -m src.cli search "Your query here"# Test different chunk sizes
for size in 200 500 1000; do
sed -i "s/CHUNK_SIZE=.*/CHUNK_SIZE=$size/" .env
streamlit run app.py # Re-run and observe
doneExpected findings:
- Smaller chunks: More results, more specific
- Larger chunks: Fewer results, more context
- Find your sweet spot (usually 300-700)
# Test impact of overlap
python -c "
from src.search_engine import SemanticSearchEngine
for overlap in [0, 50, 150]:
engine = SemanticSearchEngine(chunk_overlap=overlap)
engine.index_documents('./data/documents')
results = engine.search('machine learning')
print(f'Overlap {overlap}: {len(results)} results')
"Expected findings:
- No overlap: Fewer chunks, potential gaps
- High overlap: More chunks, better coverage
- Storage trade-off matters at scale
from src.search_engine import SemanticSearchEngine
# Compare metrics
for metric in ["cosine", "dot_product", "euclidean"]:
engine = SemanticSearchEngine(similarity_metric=metric)
results = engine.search("embeddings")
# Compare result rankings┌──────────────────────────────────────────────────────────┐
│ 1. INGESTION (ingestion.py) │
│ Load PDF/TXT/MD → Raw text │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 2. CHUNKING (chunking.py) │
│ Split text → Fixed-size pieces with overlap │
│ E.g., 500 chars/chunk, 100 char overlap │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 3. EMBEDDINGS (embeddings.py) │
│ Each chunk → 768-dim vector │
│ Via Ollama + nomic-embed-text │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 4. VECTOR STORE (vector_store.py) │
│ Store embeddings + metadata in ChromaDB │
│ Build HNSW indexes for fast search │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 5. PERSISTENCE │
│ Save to ./data/chroma_db │
│ Load on startup (no re-indexing needed) │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ 1. USER QUERY │
│ "What is semantic search?" │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 2. EMBED QUERY (embeddings.py) │
│ Query → 768-dim vector (same space as documents) │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 3. SEARCH (vector_store.py) │
│ Find top-K vectors closest to query │
│ Use HNSW index (fast approximate search) │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 4. SIMILARITY (similarity.py) │
│ Calculate cosine similarity scores │
│ Range: 0.0 (unrelated) to 1.0 (identical) │
└────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ 5. RETURN RESULTS │
│ [ │
│ {score: 0.87, source: "embeddings_guide.md", ...},│
│ {score: 0.81, source: "ml_intro.md", ...}, │
│ ... │
│ ] │
└──────────────────────────────────────────────────────────┘
- Text → vectors (embeddings)
- Similar meaning → vectors close together
- Distance between vectors = dissimilarity score
- Cosine similarity = industry standard
- Embeddings work better on focused text
- Smaller chunks → more retrieval results
- Overlap → improves recall
- Trade-off between granularity and context
- Can't do similarity search efficiently with SQL
- Need specialized indexes (HNSW, IVF, etc.)
- ChromaDB provides this with minimal setup
- Scales from thousands to millions of vectors
✅ Finds semantically related content
✅ Works across different phrasings
✅ Fast at scale (with proper indexing)
✅ No need for manual tagging/labels
❌ Semantic ambiguity (query could mean multiple things)
❌ False positives (unrelated but similar vectors)
❌ Struggles with negation ("NOT machine learning" still retrieves ML)
❌ Requires good embeddings model
# In src/embeddings.py, add support for:
- mxbai-embed-large (larger, better quality)
- all-MiniLM-L6-v2 (smaller, faster)
- Custom fine-tuned models# After initial search, rerank results with:
- Different similarity metric
- Cross-encoder model
- Custom scoring function# Use retrieved chunks to generate answers:
- Integrate with local LLM (Ollama)
- Create RAG (Retrieval Augmented Generation) pipeline
- Combine search + generation for QA# Enhance Streamlit app:
- File upload for documents
- Show chunk relationships
- Visualize embedding space (UMAP/t-SNE)
- Export results as PDF# Move to production:
- Switch to Pinecone for larger scale
- Add API endpoints (FastAPI)
- Implement caching
- Add authentication
- Set up continuous indexingSymptom: RuntimeError: Cannot connect to Ollama at http://localhost:11434
Solutions:
# 1. Check Ollama is running
ollama serve
# 2. Verify connectivity
curl http://localhost:11434/api/tags
# 3. Check .env has correct URL
grep OLLAMA_BASE_URL .env
# 4. On macOS, try different host
# Edit .env: OLLAMA_BASE_URL=http://127.0.0.1:11434Symptom: Error: Model 'nomic-embed-text' not found
Solutions:
# Pull the model
ollama pull nomic-embed-text
# Check installed models
ollama list
# Try different model
# Edit .env: OLLAMA_MODEL=all-MiniLM-L6-v2
# Then: ollama pull all-MiniLM-L6-v2Symptom: Index succeeds but finds 0 documents
Solutions:
# 1. Check directory exists
ls ./data/documents
# 2. Add sample documents
# Already included in this repo
# 3. Check file extensions
# Must be .pdf, .txt, .md (case-sensitive on Linux/Mac)Symptom: Search runs but no results found
Solutions:
# 1. Check index isn't empty
# Use Stats tab in Streamlit
# 2. Try simpler query
# Complex queries might not match anything
# 3. Reduce similarity threshold
# Modify vector_store.py to return lower-scored results
# 4. Check ChromaDB folder
ls -la ./data/chroma_dbSymptom: Queries take >5 seconds
Solutions:
# 1. Reduce chunk count
# Edit .env: TOP_K=3 (instead of 5)
# 2. Smaller chunks (faster indexing)
# Edit .env: CHUNK_SIZE=300 (instead of 500)
# 3. Reduce overlap
# Edit .env: CHUNK_OVERLAP=0 (instead of 100)
# 4. Check Ollama isn't overloaded
# Look at Ollama server consoleOn typical modern hardware:
| Operation | Typical Time | Notes |
|---|---|---|
| Embed 1000 chunks | 30-60 seconds | Depends on Ollama hardware |
| Index in ChromaDB | 5-10 seconds | 1000 chunks |
| Single query | 100-500ms | Depends on index size |
| Full pipeline (100 doc) | 2-5 minutes | Single thread |
Use the sample documents included (machine_learning_intro.md, embeddings_guide.md, vector_databases.md):
"What are embeddings?"
"How do neural networks work?"
"Tell me about vector databases"
"What is the relationship between embeddings and machine learning?"
"How do vector databases enable semantic search?"
"Benefits of semantic search"
"Machine learning vs deep learning"
"Embeddings in production"
"Not machine learning" # Will still find ML content
"Opposite of clustering" # Might confuse
- Chose: Fixed-size
- Why: Simpler, more predictable, good learning tool
- Alternative: Semantic would preserve meaning better but requires sentence detection
- Chose: Ollama (local)
- Why: Free, private, no API dependency, perfect for learning
- Alternative: OpenAI better quality but costs money
- Chose: ChromaDB
- Why: No separate server, lightweight, good for learning
- Alternative: Pinecone scales better but less educational
- Chose: Streamlit
- Why: Rapid development, great for demos, easy to understand
- Alternative: FastAPI better for production APIs
- Chose: Cosine
- Why: Scale-invariant, industry standard, more intuitive
- Alternative: Dot product faster if vectors normalized
This is a learning project. Feel free to:
- Modify configurations and experiment
- Add new chunking strategies
- Implement different similarity metrics
- Add new embedding providers
- Improve the Streamlit UI
MIT License - feel free to use for learning and projects
Week 3: Multi-hop retrieval and query expansion
Week 4: Reranking and relevance optimization
Week 5: RAG (Retrieval Augmented Generation) with LLMs
Week 6: Production deployment and scaling
Refer to:
- LEARNING_GUIDE.md - Detailed educational content
- Code comments - Each module has detailed docstrings
- This README - Overview and troubleshooting
Happy learning! 🚀