Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
- `latest/`: Current course code and the WildChat case study (`latest/case_study/{core,pipelines}`).
- `cohort_1/`, `cohort_2/`: Earlier cohort materials kept for reference.
- `docs/`: MkDocs book sources; site config in `mkdocs.yml`.
- `docs/workshops/`: Chapter content `chapterN.md` and subparts `chapterN-M.md`, plus `chapterN-slides.md`; entrypoint is `docs/workshops/index.md`.
- `docs/workshops/`: Chapter content `chapterN.md` and subparts `chapterN-M.md`; entrypoint is `docs/workshops/index.md`.
- `docs/slides/`: Slide decks `chapterN-slides.md` for workshop chapters.
- `md/`: Markdown exports of notebooks; images in `images/`.
- `scripts/`, `build_book.sh`: Utilities for diagrams and building the PDF/ebook.

Expand Down
195 changes: 79 additions & 116 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,25 @@
# Systematically Improving RAG Applications

A comprehensive course teaching data-driven approaches to building and improving Retrieval-Augmented Generation (RAG) systems. This repository contains course materials, code examples, and a companion book.
A comprehensive educational resource teaching data-driven approaches to building and improving Retrieval-Augmented Generation systems that get better over time. Learn from real case studies with concrete metrics showing how RAG systems improve from 60% to 85%+ accuracy through systematic measurement and iteration.

## 🎓 Take the Course
## What You'll Learn

All of this material is supported by the **Systematically Improving RAG Course**.
Transform RAG from a technical implementation into a continuously improving product through:

[**Click here to get 20% off →**](https://maven.com/applied-llms/rag-playbook?promoCode=EBOOK)
- **Data-driven evaluation**: Establish metrics before building features
- **Systematic improvement**: Turn evaluation insights into measurable gains
- **User feedback loops**: Design systems that learn from real usage
- **Specialized retrieval**: Build purpose-built retrievers for different content types
- **Intelligent routing**: Orchestrate multiple specialized components
- **Production deployment**: Maintain improvement velocity at scale

## Course Overview
### Real Case Studies Featured

This course teaches you how to systematically improve RAG applications through:
**Legal Tech Company**: 63% → 87% accuracy over 3 months through systematic error analysis, better chunking, and validation patterns. Generated 50,000+ citation examples for continuous training.

- Data-driven evaluation and metrics
- Embedding fine-tuning and optimization
- Query understanding and routing
- Structured data integration
- Production deployment strategies
**Construction Blueprint Search**: 27% → 85% recall in 4 days by using vision models for spatial descriptions. Further improved to 92% for counting queries through bounding box detection.

**Feedback Collection**: 10 → 40 daily submissions (4x improvement) through better UX copy and interactive elements, enabling faster improvement cycles.

### The RAG Flywheel

Expand All @@ -31,145 +34,105 @@ The core philosophy centers around the "RAG Flywheel" - a continuous improvement

```text
.
├── cohort_1/ # First cohort materials (6 weeks)
├── cohort_2/ # Second cohort materials (weeks 0-6)
├── latest/ # Current course version with latest updates
│ ├── week0/ # Getting started with Jupyter, LanceDB, and evals
│ ├── week1/ # RAG evaluation foundations
│ ├── week2/ # Embedding fine-tuning
│ ├── week4/ # Query understanding and routing
│ ├── week5/ # Structured data and metadata
│ ├── week6/ # Tool selection and product integration
│ ├── case_study/ # Comprehensive WildChat project
│ └── extra_kura/ # Advanced notebooks on clustering and classifiers
├── docs/ # MkDocs documentation source
│ ├── workshops/ # Detailed chapter guides (0-7) aligned with course weeks
│ ├── talks/ # Industry expert presentations and case studies
│ ├── office-hours/# Q&A summaries from cohorts 2 and 3
│ ├── assets/ # Images and diagrams for documentation
├── docs/ # Complete workshop series (Chapters 0-7)
│ ├── workshops/ # Progressive learning path from evaluation to production
│ ├── talks/ # Industry expert presentations with case studies
│ ├── office-hours/# Q&A summaries addressing real implementation challenges
│ └── misc/ # Additional learning resources
├── data/ # CSV files from industry talks
├── md/ # Markdown conversions of notebooks
├── latest/ # Reference implementations and case study code
│ ├── case_study/ # Comprehensive WildChat project demonstrating concepts
│ ├── week0-6/ # Code examples aligned with workshop chapters
│ └── examples/ # Standalone demonstrations
├── data/ # Real datasets from case studies and talks
└── mkdocs.yml # Documentation configuration
```

## Course Structure: Weekly Curriculum & Book Chapters
## Learning Path: Workshop Chapters

The course follows a 6-week structure where each week corresponds to specific workshop chapters in the companion book:
The workshops follow a systematic progression from evaluation to production:

### Week 1: Starting the Flywheel
### Chapter 0: Beyond Implementation to Improvement

- **Book Coverage**: Chapter 0 (Introduction) + Chapter 1 (Starting the Flywheel with Data)
- **Topics**:
- Shifting from static implementations to continuously improving products
- Overcoming the cold-start problem through synthetic data generation
- Establishing meaningful metrics aligned with business goals
- RAG as a recommendation engine wrapped around language models
Mindset shift from technical project to product. See how the legal tech company went from 63% to 87% accuracy by treating RAG as a recommendation engine with continuous feedback loops.

### Week 2: From Evaluation to Enhancement
### Chapter 1: Starting the Data Flywheel

- **Book Coverage**: Chapter 2 (From Evaluation to Product Enhancement)
- **Topics**:
- Transforming evaluation insights into concrete improvements
- Fine-tuning embeddings with Cohere and open-source models
- Re-ranking strategies and targeted capability development
Build evaluation frameworks before you have users. Learn from the blueprint search case: 27% → 85% recall in 4 days through synthetic data and task-specific vision model prompting.

### Week 3: User Experience Design
### Chapter 2: From Evaluation to Enhancement

- **Book Coverage**: Chapter 3 (UX - 3 parts)
- Part 1: Design Principles
- Part 2: Feedback Collection
- Part 3: Iterative Improvement
- **Topics**:
- Building interfaces that delight users and gather feedback
- Creating virtuous cycles of improvement
- Continuous refinement based on user interaction
Turn evaluation insights into measurable improvements. Fine-tuning embeddings delivers 6-10% gains. Learn when to use re-rankers vs custom embeddings based on your data distribution.

### Week 4: Query Understanding & Topic Modeling
### Chapter 3: User Experience (3 Parts)

- **Book Coverage**: Chapter 4 (Topic Modeling - 2 parts)
- Part 1: Analysis - Segmenting users and queries
- Part 2: Prioritization - High-value opportunities
- **Topics**:
- Query classification with BERTopic
- Pattern discovery in user queries
- Creating improvement roadmaps based on usage patterns
**3.1 - Feedback Collection**: Zapier increased feedback from 10 to 40 submissions/day through better UX copy
**3.2 - Perceived Performance**: 11% perception improvement equals 40% reduction in perceived wait time
**3.3 - Quality of Life**: Citations, validation, chain-of-thought delivering 18% accuracy improvements

### Chapter 4: Understanding Users (2 Parts)

**4.1 - Finding Patterns**: Construction company discovered 8% of queries (scheduling) drove 35% of churn
**4.2 - Prioritization**: Use 2x2 frameworks to choose what to build next based on volume and impact

### Chapter 5: Specialized Retrieval (2 Parts)

**5.1 - Foundations**: Why one-size-fits-all fails. Different queries need different approaches
**5.2 - Implementation**: Documents, images, tables, SQL - each needs specialized handling

### Week 5: Multimodal & Structured Data
### Chapter 6: Unified Architecture (3 Parts)

- **Book Coverage**: Chapter 5 (Multimodal - 2 parts)
- Part 1: Understanding different content types
- Part 2: Implementation strategies
**6.1 - Query Routing**: Construction company: 65% → 78% through proper routing (95% × 82% = 78%)
**6.2 - Tool Interfaces**: Clean APIs enable parallel development. 40 examples/tool = 95% routing accuracy
**6.3 - Performance Measurement**: Two-level metrics separate routing failures from retrieval failures

### Chapter 7: Production Considerations

Maintain improvement velocity at scale. Construction company: 78% → 84% success while scaling 5x query volume and reducing unit costs from $0.09 to $0.04 per query.

- Part 1: Understanding different content types
- Part 2: Implementation strategies
- **Topics**:
- Working with documents, images, tables, and structured data
- Metadata filtering and Text-to-SQL integration
- PDF parsing and multimodal embeddings

### Week 6: Architecture & Product Integration

- **Book Coverage**: Chapter 6 (Architecture - 3 parts)
- Part 1: Intelligent routing to specialized components
- Part 2: Building and integrating specialized tools
- Part 3: Creating unified product experiences
- **Topics**:
- Tool evaluation and selection
- Performance optimization strategies
- Streaming implementations and production deployment

### Capstone Project

A comprehensive project using the WildChat dataset that covers:
## Technologies & Tools

- Data exploration and understanding
- Vector database integration (ChromaDB, LanceDB, Turbopuffer)
- Synthetic question generation
- Summarization strategies
- Complete test suite implementation

## Technologies Used
The workshops use industry-standard tools for production RAG systems:

- **LLM APIs**: OpenAI, Anthropic, Cohere
- **Vector Databases**: LanceDB, ChromaDB, Turbopuffer
- **ML/AI Frameworks**: Sentence-transformers, BERTopic, Transformers
- **Evaluation Tools**: Braintrust, Pydantic-evals
- **Monitoring**: Logfire, production monitoring strategies
- **Data Processing**: Pandas, NumPy, BeautifulSoup, SQLModel
- **Visualization**: Matplotlib, Seaborn, Streamlit
- **CLI Framework**: Typer + Rich for interactive command-line tools
- **Document Processing**: Docling for PDF parsing and analysis
- **Frameworks**: Sentence-transformers, BERTopic, Transformers, Instructor
- **Evaluation**: Synthetic data generation, precision/recall metrics, A/B testing
- **Monitoring**: Logfire, production observability patterns
- **Processing**: Pandas, SQLModel, Docling for PDF parsing

## Course Book & Documentation
## Documentation

The `/docs` directory contains a comprehensive book built with MkDocs that serves as the primary learning resource:
The `/docs` directory contains comprehensive workshop materials built with MkDocs:

### Book Structure
### Content Overview

- **Introduction & Core Concepts**: The RAG Flywheel philosophy and product-first thinking
- **Workshop Chapters (0-6)**: Detailed guides that map directly to each course week
- **Office Hours**: Q&A summaries from Cohorts 2 and 3 with real-world implementation insights
- **Industry Talks**: Expert presentations including:
- RAG Anti-patterns in the Wild
- Semantic Search Over the Web
- Understanding Embedding Performance
- Online Evals and Production Monitoring
- RAG Without APIs (Browser-based approaches)
- **Workshop Chapters (0-7)**: Complete learning path from evaluation to production
- **Office Hours**: Q&A summaries addressing real implementation challenges
- **Industry Talks**: Expert presentations on RAG anti-patterns, embedding performance, production monitoring
- **Case Studies**: Detailed examples with specific metrics and timelines

### Key Themes in the Book
### Core Philosophy

1. **Product-First Thinking**: Treating RAG as an evolving product, not a static implementation
2. **Data-Driven Improvement**: Using metrics, evaluations, and user feedback to guide development
3. **Systematic Approach**: Moving from ad-hoc tweaking to structured improvement processes
4. **User-Centered Design**: Focusing on user value and experience, not just technical capabilities
5. **Continuous Learning**: Building systems that improve with every interaction
1. **Product mindset**: RAG as evolving product, not static implementation
2. **Data-driven improvement**: Metrics and feedback guide development
3. **Systematic approach**: Structured improvement processes over ad-hoc tweaking
4. **User-centered design**: Focus on user value, not just technical capabilities
5. **Continuous learning**: Systems that improve with every interaction

To build and view the documentation:
Build and view documentation:

```bash
# Serve documentation locally (live reload)
mkdocs serve

# Build static documentation
mkdocs build
mkdocs serve # Local development with live reload
mkdocs build # Static site generation
```

## Getting Started
Expand Down Expand Up @@ -235,4 +198,4 @@ This course emphasizes:

## License

This is educational material for the "Systematically Improving RAG Applications" course.
This is educational material for the "Systematically Improving RAG Applications" course.
26 changes: 26 additions & 0 deletions all_providers_test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Embedding Latency Benchmark Results

**Text analyzed:** 100 samples, avg 11.8 tokens each

## Key Finding

Embedding latency dominates RAG pipeline performance:
- Database reads: 8-20ms
- Embedding generation: 100-500ms (10-25x slower!)

## Results

| Provider/Model | Batch Size | P50 (ms) | P95 (ms) | P99 (ms) | Throughput (emb/s) | Embeddings | Status |
|:------------------------------|-------------:|:-------------|:-------------|:--------------|---------------------:|-------------:|:---------|
| Cohere/embed-v4.0 | 1 | 287.4 ±110.5 | 447.8 ±6.7 | 453.2 ±1.3 | 32.1 | 100 | ✅ OK |
| Cohere/embed-v4.0 | 10 | 909.6 ±49.7 | 954.5 ±4.8 | 958.4 ±1.0 | 27.6 | 100 | ✅ OK |
| Cohere/embed-v4.0 | 25 | 187.7 ±19.3 | 580.7 ±31.5 | 621.1 ±31.5 | 3.9 | 100 | ✅ OK |
| Gemini/gemini-embedding-001 | 1 | 334.9 ±282.4 | 634.1 ±12.4 | 644.1 ±2.5 | 24.3 | 100 | ✅ OK |
| Gemini/gemini-embedding-001 | 10 | 515.2 ±145.0 | 646.7 ±13.4 | 657.4 ±2.7 | 48.9 | 100 | ✅ OK |
| Gemini/gemini-embedding-001 | 25 | 305.5 ±21.0 | 482.0 ±103.0 | 625.7 ±453.7 | 3.1 | 100 | ✅ OK |
| Openai/text-embedding-3-large | 1 | 576.1 ±81.9 | 751.9 ±40.8 | 784.5 ±8.2 | 17.4 | 100 | ✅ OK |
| Openai/text-embedding-3-large | 10 | 607.0 ±41.4 | 646.2 ±2.2 | 647.9 ±0.4 | 43.5 | 100 | ✅ OK |
| Openai/text-embedding-3-large | 25 | 337.8 ±20.2 | 476.2 ±51.9 | 563.6 ±57.4 | 2.9 | 100 | ✅ OK |
| Openai/text-embedding-3-small | 1 | 986.3 ±31.9 | 1029.1 ±5.2 | 1033.3 ±1.0 | 10.2 | 100 | ✅ OK |
| Openai/text-embedding-3-small | 10 | 1032.0 ±69.6 | 1094.2 ±7.4 | 1100.2 ±1.5 | 24.4 | 100 | ✅ OK |
| Openai/text-embedding-3-small | 25 | 244.1 ±57.9 | 909.7 ±22.3 | 1133.2 ±793.4 | 2.8 | 100 | ✅ OK |
18 changes: 18 additions & 0 deletions benchmark_results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Embedding Latency Benchmark Results

**Text analyzed:** 25 samples, avg 14.1 tokens each

## Key Finding

Embedding latency dominates RAG pipeline performance:
- Database reads: 8-20ms
- Embedding generation: 100-500ms (10-25x slower!)

## Results

| Provider/Model | Batch Size | P50 (ms) | P95 (ms) | P99 (ms) | Throughput (emb/s) | Embeddings | Status |
|:------------------------------|-------------:|-----------:|-----------:|-----------:|---------------------:|-------------:|:---------|
| Openai/text-embedding-3-large | 1 | 247.8 | 315 | 329.4 | 7.5 | 25 | ✅ OK |
| Openai/text-embedding-3-large | 2 | 312.8 | 940.5 | 1042.6 | 4.5 | 25 | ✅ OK |
| Openai/text-embedding-3-small | 1 | 390.4 | 689 | 751.4 | 2.5 | 25 | ✅ OK |
| Openai/text-embedding-3-small | 2 | 225.5 | 554.8 | 589.5 | 3.5 | 25 | ✅ OK |
Loading
Loading