diff --git a/docs/workshops/chapter0.md b/docs/workshops/chapter0.md index 2ad8220..2c75044 100644 --- a/docs/workshops/chapter0.md +++ b/docs/workshops/chapter0.md @@ -242,6 +242,14 @@ They added a simple feature: when someone asks that, the AI recommends the three Consider how this played out with a legal tech company building case law search: +| Month | Focus | Overall Accuracy | Key Change | +|-------|-------|-----------------|------------| +| 1 | Baseline | 63% | Generated 200 test queries | +| 2 | Chunking | 72% | Fixed legal citation splitting | +| 3 | Deployment | 72% | Added feedback collection | +| 4-5 | Discovery | 72% | Identified 3 query patterns | +| 6 | Specialization | 87% | Built dedicated retrievers | + **Month 1 - Baseline:** Basic RAG with standard embeddings. Lawyers complained it "never found the right cases." We generated 200 test queries from their actual case law. Baseline accuracy: 63%. **Month 2 - First Iteration:** Testing different approaches revealed that legal jargon broke standard chunking. Legal citations like "42 U.S.C. § 1983" were being split across chunks, destroying meaning. Fixed the chunking strategy to respect legal citation patterns. Accuracy improved to 72%. @@ -250,9 +258,11 @@ Consider how this played out with a legal tech company building case law search: **Months 4-5 - Pattern Discovery:** After 2 months and 5,000 queries, three distinct patterns emerged: -- Case citations: 40% of queries, 91% accuracy (worked great) -- Legal definitions: 35% of queries, 78% accuracy (acceptable) -- Procedural questions: 25% of queries, 34% accuracy (total failure) +| Query Type | Volume | Accuracy | Status | +|------------|--------|----------|--------| +| Case citations | 40% | 91% | Working well | +| Legal definitions | 35% | 78% | Acceptable | +| Procedural questions | 25% | 34% | Failing | **Month 6 - Specialized Solutions:** Built dedicated retrieval strategies for each type. Case citations got exact matching on citation format. Definitions got a specialized glossary index. Procedural questions got a separate index built from court rules and practice guides. Overall accuracy jumped to 87%. @@ -302,22 +312,40 @@ Learn how to overcome the cold-start problem through synthetic data generation, Discover how to transform evaluation insights into concrete product improvements through fine-tuning, re-ranking, and targeted capability development. -### [Chapter 3: The User Experience of AI](chapter3-1.md) +### Chapter 3: The User Experience of AI + +Explore how to design interfaces that both delight users and gather valuable feedback, creating the virtuous cycle at the heart of the improvement flywheel. This chapter has three parts: -Explore how to design interfaces that both delight users and gather valuable feedback, creating the virtuous cycle at the heart of the improvement flywheel. +- [Chapter 3.1: Feedback Collection](chapter3-1.md) - Getting users to actually give feedback +- [Chapter 3.2: Overcoming Latency](chapter3-2.md) - Making RAG feel fast +- [Chapter 3.3: Quality of Life](chapter3-3.md) - Small changes with big impact -### [Chapter 4: Understanding Your Users](chapter4-1.md) +### Chapter 4: Understanding Your Users Learn techniques for segmenting users and queries to identify high-value opportunities and create prioritized improvement roadmaps. -### [Chapter 5: Building Specialized Capabilities](chapter5-1.md) +- [Chapter 4.1: Topic Modeling](chapter4-1.md) - Finding patterns in user data +- [Chapter 4.2: Prioritization](chapter4-2.md) - Deciding what to build next + +### Chapter 5: Building Specialized Capabilities Develop purpose-built solutions for different user needs, spanning documents, images, tables, and structured data. -### [Chapter 6: Unified Product Architecture](chapter6-1.md) +- [Chapter 5.1: Understanding Specialization](chapter5-1.md) - When one size does not fit all +- [Chapter 5.2: Implementation](chapter5-2.md) - Search beyond text + +### Chapter 6: Unified Product Architecture Create a cohesive product experience that intelligently routes to specialized components while maintaining a seamless user experience. +- [Chapter 6.1: Query Routing](chapter6-1.md) - Routing basics +- [Chapter 6.2: Tool Interfaces](chapter6-2.md) - Building the router +- [Chapter 6.3: Measurement](chapter6-3.md) - Measuring and improving routers + +### [Chapter 7: Production Considerations](chapter7.md) + +Keep the improvement flywheel spinning at scale. Learn cost optimization strategies, monitoring approaches that connect back to your evaluation metrics, graceful degradation patterns, and how to maintain improvement velocity as usage grows from hundreds to thousands of daily queries. + ## How You'll Know It's Working Here's what changes when you get this right: @@ -351,3 +379,8 @@ Next up: we'll dive into the first step of the flywheel—creating synthetic dat _Note: This approach has been applied across legal, finance, healthcare, and e-commerce domains. The details change, but the core flywheel stays the same: focus on users, measure what matters, and improve based on data instead of hunches._ --- + +## Navigation + +- **Next**: [Chapter 1: Starting the Flywheel](chapter1.md) - Synthetic data and evaluation +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) | [How to Use This Book](how-to-use.md) diff --git a/docs/workshops/chapter1.md b/docs/workshops/chapter1.md index 13cb279..ce64e91 100644 --- a/docs/workshops/chapter1.md +++ b/docs/workshops/chapter1.md @@ -13,6 +13,15 @@ tags: # Kickstarting the Data Flywheel with Synthetic Data +!!! abstract "Chapter at a Glance" + **Time**: 45 min reading + 2-3 hours hands-on | **Prerequisites**: Basic Python, familiarity with embeddings + + **You will learn**: How to build evaluation frameworks using synthetic data, measure retrieval with precision/recall, and avoid common pitfalls like vague metrics and intervention bias. + + **Key outcome**: An evaluation pipeline that lets you measure improvements objectively before you have real users. + + **Case studies**: Consulting firm (50% → 90% recall), Blueprint search (27% → 85% in 4 days) + ### Key Insight **You can't improve what you can't measure—and you can measure before you have users.** Synthetic data isn't just a stopgap until real users arrive. It's a powerful tool for establishing baselines, testing edge cases, and building the evaluation infrastructure that will power continuous improvement. Start with retrieval metrics (precision and recall), not generation quality, because they're faster, cheaper, and more objective. @@ -372,6 +381,13 @@ Two examples demonstrate how focusing on retrieval metrics leads to rapid improv A consulting firm generates reports from user research interviews. Consultants conduct 15-30 interviews per project and need AI-generated summaries that capture all relevant insights. +| Stage | Recall | Time | Key Change | +|-------|--------|------|------------| +| Baseline | 50% | - | Missing half of relevant quotes | +| Iteration 1 | 70% | 1 week | Identified chunking issue | +| Iteration 2 | 85% | 2 weeks | Fixed Q&A splitting | +| Iteration 3 | 90% | 3 weeks | Added chunk overlap | + **Problem**: Reports were missing critical quotes. A consultant knew 6 experts said something similar, but the report only cited 3. That 50% recall rate destroyed trust. Consultants started spending hours manually verifying reports, defeating the automation's purpose. **Investigation**: Built manual evaluation sets from problematic examples. The issues turned out to be surprisingly straightforward—text chunking was breaking mid-quote and splitting speaker attributions from their statements. @@ -384,6 +400,12 @@ A consulting firm generates reports from user research interviews. Consultants c A construction technology company needed AI search for building blueprints. Workers asked questions like "Which rooms have north-facing windows?" or "Show me all electrical outlet locations in the second-floor bedrooms." +| Stage | Recall | Time | Key Change | +|-------|--------|------|------------| +| Baseline | 27% | - | Text embeddings on blueprints | +| Vision captions | 85% | 4 days | Added spatial descriptions | +| Counting queries | 92% | +2 weeks | Bounding box detection | + **Problem**: Only 27% recall when finding the right blueprint sections for questions. Workers would ask simple spatial questions and get completely unrelated blueprint segments. The system was essentially useless—workers abandoned it and went back to manually scrolling through PDFs. **Investigation**: Standard text embeddings couldn't handle the spatial and visual nature of blueprint queries. "North-facing windows" and "electrical outlets" are visual concepts that don't translate well to text chunks. @@ -834,6 +856,9 @@ Example: One team found 5 irrelevant documents (even marked as "potentially less - **[Prompttools](https://github.com/promptslab/prompttools)**: Toolkit for testing and evaluating LLM applications - **[MLflow for Experiment Tracking](https://mlflow.org/)**: Open-source platform for managing ML lifecycle +!!! tip "Hands-On Practice" + For step-by-step exercises to apply these concepts, see [Exercises: Chapter 1](exercises.md#chapter-1-evaluation-foundations). + ## This Week's Action Items Based on the content covered, here are your specific tasks: @@ -894,6 +919,14 @@ Take a minute to think about: 4. What experiment could you run this week to test an improvement hypothesis? 5. How will you incorporate real user feedback as it comes in? +!!! example "Hands-On Practice: WildChat Case Study" + Apply these concepts with the WildChat case study, which demonstrates the evaluation framework in action: + + - **[Part 1: Data Exploration](../../latest/case_study/teaching/part01/README.md)** - Understand your dataset before building evaluations + - **[Part 2: The Alignment Problem](../../latest/case_study/teaching/part02/README.md)** - See how synthetic query generation (v1 vs v2) reveals system limitations + + The case study shows a real example of going from 12% to 62% Recall@1 by understanding what you are actually measuring. + ## Conclusion and Next Steps We've covered the foundation for systematic RAG improvement through proper evaluation. No more subjective judgments or random changes - you now have tools to measure progress objectively and make data-driven decisions. @@ -917,3 +950,9 @@ The goal isn't chasing the latest AI techniques. It's building a flywheel of con As one client told me: "We spent three months trying to improve through prompt engineering and model switching. In two weeks with proper evaluations, we made more progress than all that time combined." --- + +## Navigation + +- **Previous**: [Introduction: Beyond Implementation to Improvement](chapter0.md) - The product mindset for RAG +- **Next**: [Chapter 2: From Evaluation to Enhancement](chapter2.md) - Converting evaluations into training data +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/chapter2.md b/docs/workshops/chapter2.md index a2a3051..1d55b03 100644 --- a/docs/workshops/chapter2.md +++ b/docs/workshops/chapter2.md @@ -133,7 +133,7 @@ Build a library of few-shot examples systematically: ### Structured Few-Shot Prompt Example -``` +```text You are an assistant specialized in answering questions about [domain]. Here are some examples of how to answer questions: @@ -244,13 +244,13 @@ For RAG applications, there are several natural ways to create triplet datasets: Imagine a healthcare RAG application where a user asks: -``` +```text What are the side effects of medication X? ``` Our retrieval system might return several documents, including: -``` +```text Document A: "Medication X may cause drowsiness, nausea, and in rare cases, allergic reactions." Document B: "Medication X is used to treat high blood pressure and should be taken with food." @@ -605,33 +605,28 @@ Linear adapters add a small trainable layer on top of frozen embeddings: - Train ## Additional Resources !!! info "Tools and Libraries" + **Understanding Embedding Models** -``` -### Understanding Embedding Models + 1. **Sentence Transformers Library** ([https://www.sbert.net/](https://www.sbert.net/)): This library provides easy-to-use implementations for state-of-the-art embedding models, supporting both pairwise datasets and triplets for fine-tuning. It's my recommended starting point for most teams due to its balance of performance and ease of use. -1. **Sentence Transformers Library** ([https://www.sbert.net/](https://www.sbert.net/)): This library provides easy-to-use implementations for state-of-the-art embedding models, supporting both pairwise datasets and triplets for fine-tuning. It's my recommended starting point for most teams due to its balance of performance and ease of use. + 2. **Modern BERT** ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)): These newer models offer 8,000 token sequence lengths and generally outperform classic BERT-based models. The BGE models in particular have shown excellent performance across many domains and are worth testing in your applications. -2. **Modern BERT** ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)): These newer models offer 8,000 token sequence lengths and generally outperform classic BERT-based models. The BGE models in particular have shown excellent performance across many domains and are worth testing in your applications. + 3. **Cohere Re-ranking Models** ([https://cohere.com/rerank](https://cohere.com/rerank)): Cohere offers state-of-the-art re-ranking capabilities with a fine-tuning API that makes it relatively easy to customize for your specific needs. In my experience, even their base re-ranker without fine-tuning often provides substantial improvements to retrieval quality. -3. **Cohere Re-ranking Models** ([https://cohere.com/rerank](https://cohere.com/rerank)): Cohere offers state-of-the-art re-ranking capabilities with a fine-tuning API that makes it relatively easy to customize for your specific needs. In my experience, even their base re-ranker without fine-tuning often provides substantial improvements to retrieval quality. + 4. **Specialized Domains**: For specific domains like code, science, or legal documents, look for models pre-trained on related corpora. For example, CodeBERT for programming or SciBERT for scientific literature can provide better starting points than general models. -4. **Specialized Domains**: For specific domains like code, science, or legal documents, look for models pre-trained on related corpora. For example, CodeBERT for programming or SciBERT for scientific literature can provide better starting points than general models. - -5. **Comparison to Data Labeling**: Everything we're doing today with fine-tuning embedding models is what I used to pay data labeling teams hundreds of thousands of dollars to do annually. The ML playbook that was once only accessible to large companies with significant budgets is now available to teams of all sizes thanks to advances in transfer learning and fine-tuning techniques. -``` + 5. **Comparison to Data Labeling**: Everything we're doing today with fine-tuning embedding models is what I used to pay data labeling teams hundreds of thousands of dollars to do annually. The ML playbook that was once only accessible to large companies with significant budgets is now available to teams of all sizes thanks to advances in transfer learning and fine-tuning techniques. !!! info "Key Concepts" + **Contrastive Learning In-Depth** -``` -#### Contrastive Learning In-Depth - -Contrastive learning trains models to recognize similarities and differences between items by pushing and pulling examples in the embedding space: + Contrastive learning trains models to recognize similarities and differences between items by pushing and pulling examples in the embedding space: -- **Triplet Loss**: Optimizes the distance between anchor-positive pairs relative to anchor-negative pairs -- **InfoNCE Loss**: Contrasts a positive pair against multiple negative examples -- **Multiple Negatives Ranking Loss**: Handles batches of queries with multiple negatives per query + - **Triplet Loss**: Optimizes the distance between anchor-positive pairs relative to anchor-negative pairs + - **InfoNCE Loss**: Contrasts a positive pair against multiple negative examples + - **Multiple Negatives Ranking Loss**: Handles batches of queries with multiple negatives per query -#### Scaling and Efficiency Considerations + **Scaling and Efficiency Considerations** For large datasets or production workloads: @@ -702,6 +697,14 @@ Take a minute to think about: 4. If you had to prioritize one retrieval improvement for your system, would it be embeddings, re-ranking, or something else? Why? 5. What experiments could you run to test your hypotheses about improving retrieval quality? +!!! example "Hands-On Practice: WildChat Case Study" + The case study demonstrates the alignment problem and how to solve it through better embeddings: + + - **[Part 2: The Alignment Problem](../../latest/case_study/teaching/part02/README.md)** - See how v1 queries achieve 62% recall while v2 queries get only 12% on the same embeddings + - **[Part 3: Solving Through Summaries](../../latest/case_study/teaching/part03/README.md)** - Learn how changing what you embed (not just how you embed) can achieve 358% improvement + + This demonstrates the core insight: alignment between queries and embeddings matters more than model sophistication. + ## Conclusion and Next Steps We covered a lot: @@ -740,4 +743,9 @@ Do these things now: If you do this right, every piece of data makes your system better. The improvements compound over time and affect everything—clustering, topic modeling, all of it. --- -``` + +## Navigation + +- **Previous**: [Chapter 1: Starting the Flywheel](chapter1.md) - Synthetic data and evaluation +- **Next**: [Chapter 3.1: Feedback Collection](chapter3-1.md) - Getting users to actually give feedback +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/chapter3-1.md b/docs/workshops/chapter3-1.md index c5e0d24..9a4b398 100644 --- a/docs/workshops/chapter3-1.md +++ b/docs/workshops/chapter3-1.md @@ -1,11 +1,26 @@ --- title: "Chapter 3.1: Feedback Collection" description: Building feedback flywheels into your RAG applications -author: Jason Liu +authors: + - Jason Liu +date: 2025-03-21 +tags: + - feedback + - user-experience + - data-collection --- # Feedback Collection: Building Your Improvement Flywheel +!!! abstract "Chapter at a Glance" + **Time**: 30 min reading + 1-2 hours implementation | **Prerequisites**: Basic web development + + **You will learn**: How to design feedback mechanisms that collect 5x more data, mine implicit signals from user behavior, and build enterprise feedback loops with Slack integration. + + **Key outcome**: A feedback system that generates training data from every user interaction. + + **Case study**: Zapier increased feedback from 10 to 40 submissions/day with better copy and visibility. + ### Key Insight **Good copy beats good UI—changing "How did we do?" to "Did we answer your question?" increases feedback rates by 5x.** The difference between 0.1% and 0.5% feedback isn't just more data. It's the difference between flying blind and having a clear view of what's working. Design your feedback mechanisms to be specific, contextual, and integrated into the natural user flow. @@ -127,6 +142,7 @@ Best: "Did this run do what you expected it to do?" **What Actually Works:** +```text "Did we answer your question? [Yes] [Somewhat] [No]" If "Somewhat" or "No": @@ -135,14 +151,12 @@ If "Somewhat" or "No": - [ ] More detailed explanation - [ ] Different information needed - [ ] Information was wrong - -Remember: users perceive animated progress bars as **11% faster** even when wait times are identical. Good UX matters for feedback collection too. - - [ ] Better formatting -- [ ] Other: \***\*\_\_\_\_\*\*** - +- [ ] Other: _______________ ``` +Remember: users perceive animated progress bars as **11% faster** even when wait times are identical. Good UX matters for feedback collection too. + The second approach not only makes feedback impossible to miss but also structures it in a way that provides more actionable insights. Data shows that visible feedback mechanisms can increase feedback rates from less than 1% to over 30%. ### Implementation Strategies @@ -167,8 +181,8 @@ Claude's implementation of progress counters during response generation serves m - Creates natural moments for feedback collection **Implementation Pattern:** -``` +```text Searching documents... [████░░░░░░] 40% Found 5 relevant sources Analyzing content... [████████░░] 80% @@ -177,8 +191,7 @@ Generating response... [██████████] 100% [Response appears here] Did we find the right information? [Yes] [No] - -```` +``` This pattern makes feedback feel like a natural continuation of the interaction rather than an interruption. @@ -413,6 +426,9 @@ In the next chapter, explore how to reduce perceived latency through streaming a - **[Chapter 5](chapter5-1.md)**: User behavior patterns reveal which specialized retrievers to build - **[Chapter 6](chapter6-2.md)**: Feedback on router decisions improves tool selection +!!! tip "Hands-On Practice" + For step-by-step exercises to apply these concepts, see [Exercises: Chapter 3](exercises.md#chapter-3-feedback-collection). + ## This Week's Action Items Based on the content covered, here are your specific tasks for building effective feedback collection: @@ -528,4 +544,9 @@ Effective feedback collection is essential for systematic improvement of RAG sys 1. GitHub Repository: [RAG-Feedback-Collection](https://github.com/microsoft/rag-feedback-collection) - Templates and examples for implementing feedback mechanisms in RAG applications --- -```` + +## Navigation + +- **Previous**: [Chapter 2: From Evaluation to Enhancement](chapter2.md) - Converting evaluations into training data +- **Next**: [Chapter 3.2: Overcoming Latency](chapter3-2.md) - Streaming and perceived performance +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/chapter3-2.md b/docs/workshops/chapter3-2.md index d42f3a1..52b1339 100644 --- a/docs/workshops/chapter3-2.md +++ b/docs/workshops/chapter3-2.md @@ -1,7 +1,13 @@ --- title: "Chapter 3.2: Overcoming Latency" description: Techniques for enhancing both actual and perceived performance in RAG applications -author: Jason Liu +authors: + - Jason Liu +date: 2025-03-21 +tags: + - latency + - streaming + - user-experience --- # Overcoming Latency: Streaming and Interstitials @@ -605,3 +611,9 @@ Remember: If you only implement one improvement from this chapter, make it strea 1. GitHub Repository: [React Skeleton Screens](https://github.com/danilowoz/react-content-loader) - Open-source library for implementing skeleton screens in React applications --- + +## Navigation + +- **Previous**: [Chapter 3.1: Feedback Collection](chapter3-1.md) - Getting users to actually give feedback +- **Next**: [Chapter 3.3: Quality of Life](chapter3-3.md) - Citations, chain of thought, and validation patterns +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/chapter3-3.md b/docs/workshops/chapter3-3.md index bbcf4e6..b144e36 100644 --- a/docs/workshops/chapter3-3.md +++ b/docs/workshops/chapter3-3.md @@ -679,3 +679,11 @@ This completes our exploration of deployment and feedback collection. We've now In Chapter 4, shift our focus to analyzing the wealth of data you're now collecting. Through topic modeling and clustering techniques, learn to identify patterns in user queries and system performance, revealing focused opportunities for improvement. This marks an exciting transition from building a great system to understanding how it's being used in the real world and systematically enhancing its capabilities based on that understanding. By implementing the techniques from all three parts of Chapter 3, you've built the foundation for a continuous improvement cycle driven by user feedback and data analysis—a system that doesn't just answer questions but gets better with every interaction. + +--- + +## Navigation + +- **Previous**: [Chapter 3.2: Overcoming Latency](chapter3-2.md) - Streaming and perceived performance +- **Next**: [Chapter 4.1: Topic Modeling](chapter4-1.md) - Finding patterns in user data +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/chapter5-1.md b/docs/workshops/chapter5-1.md index 37afa7e..f517b30 100644 --- a/docs/workshops/chapter5-1.md +++ b/docs/workshops/chapter5-1.md @@ -427,5 +427,12 @@ Measuring both levels tells you where to focus your efforts. - **Business Impact**: Reduced time-to-answer for users in your target segment - **System Health**: Clear separation between routing accuracy and individual retriever performance +!!! example "Hands-On Practice: WildChat Case Study" + The case study demonstrates specialization through different embedding strategies: + + - **[Part 3: Solving Through Summaries](../../latest/case_study/teaching/part03/README.md)** - See how different summary techniques (v1 content-focused vs v4 pattern-focused) create specialized indices for different query types + + The key insight: v1 summaries excel at content queries (58% recall) while v4 summaries excel at pattern queries (42% recall). Building both and routing between them outperforms any single approach. + !!! tip "Next Steps" -In [Chapter 6](chapter6-1.md), explore how to bring these specialized components together through intelligent routing, creating a unified system that seamlessly directs queries to the appropriate retrievers. + In [Chapter 6](chapter6-1.md), explore how to bring these specialized components together through intelligent routing, creating a unified system that seamlessly directs queries to the appropriate retrievers. diff --git a/docs/workshops/chapter6-1.md b/docs/workshops/chapter6-1.md index 2e97049..c07f239 100644 --- a/docs/workshops/chapter6-1.md +++ b/docs/workshops/chapter6-1.md @@ -12,6 +12,15 @@ tags: # Query Routing Foundations: Building a Cohesive RAG System +!!! abstract "Chapter at a Glance" + **Time**: 30 min reading + 1-2 hours implementation | **Prerequisites**: Chapters 1-5 + + **You will learn**: How to build query routing systems that direct queries to specialized retrievers, the two-level performance formula, and team organization patterns. + + **Key outcome**: A routing architecture where P(success) = P(right retriever) × P(finding data | right retriever). + + **Case study**: Construction company improved from 65% to 78% overall success with routing. + ### Key Insight **The best retriever is multiple retrievers—success = P(selecting right retriever) × P(retriever finding data).** Query routing isn't about choosing one perfect system. It's about building a portfolio of specialized tools and letting a smart router decide. Start simple with few-shot classification, then evolve to fine-tuned models as you collect routing decisions. diff --git a/docs/workshops/chapter7.md b/docs/workshops/chapter7.md index b0a9569..1691954 100644 --- a/docs/workshops/chapter7.md +++ b/docs/workshops/chapter7.md @@ -267,30 +267,32 @@ Graceful degradation strategies: The construction company from previous chapters maintained improvement velocity in production: +| Metric | Month 1-2 | Month 3-6 | Month 7-12 | +|--------|-----------|-----------|------------| +| **Daily Queries** | 500 | 500 | 2,500 | +| **Routing Accuracy** | 95% | 95% | 96% | +| **Retrieval Accuracy** | 82% | 85% | 87% | +| **Overall Success** | 78% | 81% | 84% | +| **Daily Cost** | $45 | $32 | $98 | +| **Cost per Query** | $0.09 | $0.064 | $0.04 | +| **Feedback/Day** | 40 | 45 | 60 | + **Month 1-2 (Initial Deploy)**: -- Overall success: 78% (95% routing × 82% retrieval) -- Daily queries: 500 -- Cost: $45/day -- Feedback: 40 submissions/day +- Baseline established with evaluation framework from Chapter 1 +- Feedback collection from Chapter 3 generating 40 submissions daily **Month 3-6 (First Improvement Cycle)**: - Used feedback to identify schedule search issues (dates parsed incorrectly) - Fine-tuned date extraction (Chapter 2 techniques) -- Routing accuracy maintained at 95% -- Retrieval improved: 82% → 85% -- New overall success: 95% × 85% = 81% -- Cost optimization: $45/day → $32/day (prompt caching) +- Cost optimization through prompt caching: $45/day → $32/day **Month 7-12 (Sustained Improvement)**: -- Daily queries scaled to 2,500 (5x growth) +- 5x query growth while improving unit economics - Added new tool for permit search based on usage patterns - Updated routing with 60 examples per tool -- Overall success: 96% × 87% = 84% -- Cost: $98/day (linear scale with usage) -- Unit economics improved: $0.09/query → $0.04/query **Key Insight**: Production success meant maintaining the improvement flywheel while managing costs and reliability. The evaluation framework from Chapter 1, feedback from Chapter 3, and routing from Chapter 6 all remained active in production—continuously measuring, collecting data, and improving. @@ -394,3 +396,11 @@ For deeper dives into production topics: - [Designing Data-Intensive Applications](https://dataintensive.net/) - Scalability patterns Production readiness is an ongoing process of optimization, monitoring, and improvement - not a final destination. + +--- + +## Navigation + +- **Previous**: [Chapter 6.3: Performance Measurement](chapter6-3.md) - Measuring and improving routers +- **Start Over**: [Introduction](chapter0.md) | [How to Use This Book](how-to-use.md) +- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) diff --git a/docs/workshops/exercises.md b/docs/workshops/exercises.md new file mode 100644 index 0000000..4b2b648 --- /dev/null +++ b/docs/workshops/exercises.md @@ -0,0 +1,541 @@ +--- +title: Hands-On Exercises +description: Practical exercises to apply workshop concepts to your own RAG system +authors: + - Jason Liu +date: 2025-04-18 +tags: + - exercises + - practice + - hands-on +--- + +# Hands-On Exercises + +These exercises help you apply workshop concepts to your own RAG system. Each exercise includes clear objectives, step-by-step instructions, and expected outcomes. + +--- + +## Chapter 1: Evaluation Foundations + +### Exercise 1.1: Build Your First Evaluation Set + +**Objective**: Create 20 evaluation examples for your RAG system. + +**Time**: 1-2 hours + +**Steps**: + +1. Select 10 representative documents from your corpus +2. For each document, write 2 questions it should answer +3. Record the expected document(s) for each question +4. Format as JSON: + +```json +{ + "question": "What is the refund policy for digital products?", + "expected_docs": ["policies/refunds.md"], + "difficulty": "easy", + "category": "policy" +} +``` + +**Success criteria**: You have 20 question-document pairs covering at least 3 different query types. + +--- + +### Exercise 1.2: Measure Baseline Recall + +**Objective**: Establish your current retrieval performance. + +**Time**: 30 minutes + +**Steps**: + +1. Run your 20 evaluation questions through your retrieval system +2. For each question, check if the expected document appears in top-5 results +3. Calculate Recall@5 = (questions where expected doc found) / 20 + +```python +def calculate_recall_at_k(results, k=5): + found = 0 + for item in results: + retrieved_ids = [doc['id'] for doc in item['retrieved'][:k]] + if any(expected in retrieved_ids for expected in item['expected_docs']): + found += 1 + return found / len(results) +``` + +**Success criteria**: You have a baseline Recall@5 number (e.g., "Our current system achieves 65% Recall@5"). + +--- + +### Exercise 1.3: Generate Synthetic Questions + +**Objective**: Expand your evaluation set using LLM-generated questions. + +**Time**: 1 hour + +**Steps**: + +1. Select 5 documents you haven't used yet +2. Use this prompt to generate questions: + +```text +Given this document: +[DOCUMENT TEXT] + +Generate 3 questions that this document answers: +1. A factual question about specific information +2. A question requiring inference from the content +3. A question using different terminology than the document + +For each question, explain why this document is the correct answer. +``` + +3. Validate that your system can retrieve the source document for each question +4. Add passing questions to your evaluation set + +**Success criteria**: You have 15+ additional synthetic questions with validated retrievability. + +--- + +## Chapter 2: Fine-Tuning Foundations + +### Exercise 2.1: Create Training Triplets + +**Objective**: Build a dataset of (query, positive, negative) triplets for embedding fine-tuning. + +**Time**: 1-2 hours + +**Steps**: + +1. Take your evaluation questions from Exercise 1.1 +2. For each question, identify: + - **Positive**: The correct document + - **Easy negative**: A completely unrelated document + - **Hard negative**: A document that seems related but doesn't answer the question + +```python +triplet = { + "query": "What is the refund policy?", + "positive": "policies/refunds.md", + "easy_negative": "blog/company-history.md", + "hard_negative": "policies/returns.md" # Related but different +} +``` + +**Success criteria**: You have 20+ triplets with at least 10 hard negatives. + +--- + +### Exercise 2.2: Test a Re-ranker + +**Objective**: Measure the impact of adding a re-ranker to your pipeline. + +**Time**: 1 hour + +**Steps**: + +1. Install a re-ranker (Cohere, cross-encoder, etc.) +2. Modify your retrieval to: + - Retrieve top-50 documents + - Re-rank to top-10 +3. Re-run your evaluation set +4. Compare Recall@10 before and after re-ranking + +```python +from sentence_transformers import CrossEncoder + +reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') + +def rerank(query, documents, top_k=10): + pairs = [(query, doc['text']) for doc in documents] + scores = reranker.predict(pairs) + ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) + return [doc for doc, score in ranked[:top_k]] +``` + +**Success criteria**: You can quantify the re-ranker's impact (e.g., "Re-ranking improved Recall@10 from 72% to 84%"). + +--- + +## Chapter 3: Feedback Collection + +### Exercise 3.1: Audit Your Feedback Copy + +**Objective**: Identify and improve feedback request language. + +**Time**: 30 minutes + +**Steps**: + +1. Screenshot your current feedback UI +2. List all feedback-related text (buttons, prompts, follow-ups) +3. For each piece of text, rate it: + - Is it specific to the task? (not generic "How did we do?") + - Is it visible? (not hidden in a corner) + - Is it actionable? (can you improve based on responses?) +4. Rewrite any text that scores poorly + +**Before/After example**: + +- Before: "Rate this response" +- After: "Did this answer your question? [Yes] [Partially] [No]" + +**Success criteria**: All feedback copy is specific, visible, and actionable. + +--- + +### Exercise 3.2: Implement Implicit Signal Tracking + +**Objective**: Capture user behavior signals beyond explicit feedback. + +**Time**: 2 hours + +**Steps**: + +1. Identify 3 implicit signals relevant to your application: + - Query refinement (user rephrases immediately) + - Citation clicks (which sources users check) + - Copy actions (what users copy to clipboard) + - Abandonment (user leaves without action) + +2. Add logging for each signal: + +```python +def log_implicit_signal(session_id, signal_type, data): + event = { + "timestamp": datetime.now().isoformat(), + "session_id": session_id, + "signal_type": signal_type, # "refinement", "citation_click", "copy", "abandon" + "data": data + } + # Store in your logging system +``` + +3. After 1 week, analyze the signals + +**Success criteria**: You're capturing at least 3 implicit signals and can query them for analysis. + +--- + +## Chapter 4: Query Segmentation + +### Exercise 4.1: Manual Query Clustering + +**Objective**: Identify natural query categories in your data. + +**Time**: 1-2 hours + +**Steps**: + +1. Export 100 recent queries from your system +2. Read through them and create categories as you go (open coding) +3. Assign each query to a category +4. Calculate the distribution + +```python +categories = { + "product_lookup": 35, + "policy_questions": 28, + "troubleshooting": 22, + "comparison": 10, + "other": 5 +} +``` + +5. For each category, note: + - Estimated satisfaction (from feedback if available) + - Typical query patterns + - Current system performance + +**Success criteria**: You have 4-6 categories with distribution percentages and performance estimates. + +--- + +### Exercise 4.2: Build a 2x2 Prioritization Matrix + +**Objective**: Identify which query segments to improve first. + +**Time**: 30 minutes + +**Steps**: + +1. Take your categories from Exercise 4.1 +2. For each category, estimate: + - Volume (% of total queries) + - Satisfaction (% of positive feedback) +3. Plot on a 2x2 matrix: + +```text + High Volume + │ + ┌───────────────┼───────────────┐ + │ DANGER │ STRENGTH │ + │ [category] │ [category] │ +Low ─────┼───────────────┼───────────────┼───── High +Satisfaction │ Satisfaction + │ MONITOR │ OPPORTUNITY │ + │ [category] │ [category] │ + └───────────────┼───────────────┘ + │ + Low Volume +``` + +4. Identify your top priority (high volume, low satisfaction) + +**Success criteria**: You have a clear #1 priority segment with justification. + +--- + +## Chapter 5: Specialized Retrieval + +### Exercise 5.1: Identify Specialization Opportunities + +**Objective**: Determine which query types need specialized retrievers. + +**Time**: 1 hour + +**Steps**: + +1. Review your query categories from Chapter 4 +2. For each category, answer: + - Does standard semantic search work well? (>80% recall) + - Does it need exact matching? (product IDs, codes) + - Does it need structured data? (dates, numbers, comparisons) + - Does it need multimodal content? (images, tables) + +3. Create a specialization plan: + +| Category | Current Recall | Needs | Proposed Solution | +|----------|---------------|-------|-------------------| +| Product lookup | 45% | Exact matching | Hybrid search with SKU index | +| Policy questions | 78% | - | Keep current approach | +| Troubleshooting | 52% | Step-by-step | Structured procedure index | + +**Success criteria**: You have a prioritized list of specialization opportunities. + +--- + +### Exercise 5.2: Build a Metadata Index + +**Objective**: Create a specialized index using extracted metadata. + +**Time**: 2-3 hours + +**Steps**: + +1. Choose one document type that would benefit from metadata extraction +2. Define the metadata schema: + +```python +schema = { + "product_id": str, + "category": str, + "price_range": str, # "budget", "mid", "premium" + "features": list[str] +} +``` + +3. Extract metadata from 50 documents (manually or with LLM) +4. Create a filtered search function: + +```python +def search_with_filters(query, filters): + # First filter by metadata + candidates = filter_by_metadata(filters) + # Then semantic search within candidates + return semantic_search(query, candidates) +``` + +5. Test on relevant queries and measure improvement + +**Success criteria**: Filtered search improves recall for the target query type by 15%+. + +--- + +## Chapter 6: Query Routing + +### Exercise 6.1: Build a Simple Router + +**Objective**: Create a few-shot classifier that routes queries to tools. + +**Time**: 1-2 hours + +**Steps**: + +1. Define your tools (from Chapter 5 specialization): + +```python +tools = [ + {"name": "product_search", "description": "Find products by name, ID, or features"}, + {"name": "policy_lookup", "description": "Answer questions about policies and procedures"}, + {"name": "troubleshoot", "description": "Help diagnose and fix problems"} +] +``` + +2. Create 5 examples per tool: + +```python +examples = [ + {"query": "What's the SKU for the blue widget?", "tool": "product_search"}, + {"query": "Can I return an opened item?", "tool": "policy_lookup"}, + # ... more examples +] +``` + +3. Build the router: + +```python +def route_query(query, examples, tools): + prompt = f"""Given these tools: {tools} + + Examples: + {format_examples(examples)} + + Which tool should handle: "{query}"? + + Respond with just the tool name.""" + + return llm.complete(prompt) +``` + +4. Test on 20 queries and measure accuracy + +**Success criteria**: Router achieves 85%+ accuracy on test queries. + +--- + +### Exercise 6.2: Measure End-to-End Performance + +**Objective**: Calculate your system's overall success rate. + +**Time**: 1 hour + +**Steps**: + +1. Run 50 queries through your full pipeline (router + retrievers) +2. For each query, record: + - Was it routed correctly? (manual judgment) + - Did retrieval find the right document? +3. Calculate: + - Router accuracy = correct routes / total + - Retrieval accuracy (per tool) = correct retrievals / queries to that tool + - Overall = router accuracy × average retrieval accuracy + +```python +results = { + "router_accuracy": 0.92, # 46/50 correct routes + "retrieval_by_tool": { + "product_search": 0.85, + "policy_lookup": 0.78, + "troubleshoot": 0.72 + }, + "overall": 0.92 * 0.78 # = 0.72 or 72% +} +``` + +**Success criteria**: You have quantified end-to-end performance and identified the limiting factor (routing vs retrieval). + +--- + +## Chapter 7: Production Readiness + +### Exercise 7.1: Cost Analysis + +**Objective**: Calculate your per-query cost and identify optimization opportunities. + +**Time**: 1 hour + +**Steps**: + +1. List all cost components: + - Embedding API calls + - LLM generation calls + - Vector database queries + - Infrastructure (servers, storage) + +2. Calculate cost per query: + +```python +costs = { + "embedding": 0.0001, # $0.0001 per query embedding + "retrieval": 0.0005, # Vector DB query cost + "generation": 0.003, # LLM generation (avg tokens) + "infrastructure": 0.001 # Amortized server cost +} +total_per_query = sum(costs.values()) # $0.0046 +``` + +3. Project monthly costs at different scales: + +| Daily Queries | Monthly Cost | +|--------------|--------------| +| 100 | $14 | +| 1,000 | $140 | +| 10,000 | $1,400 | + +4. Identify top optimization opportunities + +**Success criteria**: You have a cost model and identified the top 2 cost reduction opportunities. + +--- + +### Exercise 7.2: Build a Monitoring Dashboard + +**Objective**: Create visibility into production performance. + +**Time**: 2-3 hours + +**Steps**: + +1. Define key metrics to track: + - Query volume (per hour/day) + - Latency (p50, p95, p99) + - Feedback rate (positive/negative) + - Error rate + - Cost per query + +2. Set up logging for each metric +3. Create a simple dashboard (Grafana, custom, or spreadsheet) +4. Define alert thresholds: + +```python +alerts = { + "latency_p95": {"threshold": 3000, "unit": "ms"}, + "error_rate": {"threshold": 0.05, "unit": "ratio"}, + "feedback_rate": {"threshold": 0.001, "unit": "ratio", "direction": "below"} +} +``` + +**Success criteria**: You have a dashboard showing key metrics and alerts for anomalies. + +--- + +## Capstone Exercise: Full Improvement Cycle + +**Objective**: Complete one full iteration of the RAG improvement flywheel. + +**Time**: 4-8 hours over 1-2 weeks + +**Steps**: + +1. **Measure** (Chapter 1): Establish baseline metrics +2. **Analyze** (Chapter 4): Identify lowest-performing query segment +3. **Improve** (Chapters 2, 5, 6): Implement one targeted improvement +4. **Measure Again**: Quantify the impact +5. **Document**: Write up what you learned + +**Deliverable**: A one-page summary including: +- Baseline metrics +- Problem identified +- Solution implemented +- Results achieved +- Next improvement planned + +**Success criteria**: You have completed one measurable improvement cycle and documented the process. + +--- + +*Return to [Workshop Index](index.md) | [How to Use This Book](how-to-use.md)* diff --git a/docs/workshops/glossary.md b/docs/workshops/glossary.md new file mode 100644 index 0000000..ca7bc85 --- /dev/null +++ b/docs/workshops/glossary.md @@ -0,0 +1,311 @@ +--- +title: Glossary +description: Key terms and concepts used throughout the RAG improvement workshops +authors: + - Jason Liu +date: 2025-04-18 +tags: + - reference + - glossary +--- + +# Glossary + +This glossary defines key terms used throughout the workshops. Terms are organized alphabetically for quick reference. + +--- + +## A + +### Absence Blindness + +The tendency to focus on what you can see (like generation quality) while ignoring what you cannot easily observe (like retrieval failures). Teams often spend weeks fine-tuning prompts without checking whether retrieval returns relevant documents in the first place. + +**Example**: A team optimizes their prompt for three weeks, only to discover their retrieval system returns completely irrelevant documents for 40% of queries. + +**See**: [Chapter 1](chapter1.md) + +--- + +## B + +### Bi-encoder + +An embedding model architecture where queries and documents are encoded independently into vectors, then compared using similarity metrics like cosine distance. Fast at query time because document embeddings can be precomputed, but less accurate than cross-encoders for ranking. + +**Contrast with**: Cross-encoder, Re-ranker + +**See**: [Chapter 2](chapter2.md) + +--- + +## C + +### Cold Start Problem + +The challenge of building and improving a RAG system before you have real user data. Solved through synthetic data generation—creating realistic test queries from your document corpus. + +**Example**: Generating 200 synthetic queries from legal case documents to establish baseline metrics before launching to users. + +**See**: [Chapter 1](chapter1.md) + +### Contrastive Learning + +A training approach where models learn to distinguish between similar and dissimilar examples. For embeddings, this means training on triplets of (query, positive document, negative document) so the model learns to place queries closer to relevant documents in vector space. + +**See**: [Chapter 2](chapter2.md) + +### Cross-encoder + +A model architecture that processes query and document together as a single input, producing a relevance score. More accurate than bi-encoders but much slower because it cannot precompute document representations. + +**Contrast with**: Bi-encoder + +**See**: [Chapter 2](chapter2.md) + +--- + +## D + +### Data Flywheel + +A self-reinforcing cycle where user interactions generate data that improves the system, which attracts more users, generating more data. The core concept of this workshop series. + +``` +User Interactions → Data Collection → System Improvements → Better UX → More Users → ... +``` + +**See**: [Chapter 0](chapter0.md), [Chapter 1](chapter1.md) + +--- + +## E + +### Embedding + +A dense vector representation of text (or other content) that captures semantic meaning. Similar texts have similar embeddings, enabling semantic search through vector similarity. + +**Related**: Vector database, Cosine similarity + +### Embedding Alignment + +The match between what your queries ask about and what information your embeddings capture. If you embed only the first message of conversations but search for conversation patterns, you have an alignment problem—the embeddings do not contain the information the queries seek. + +**Example**: Embedding product descriptions but searching for "products similar to what I bought last month" fails because purchase history is not in the embeddings. + +**See**: [Chapter 5](chapter5-1.md) + +### Experiment Velocity + +The rate at which you can test hypotheses about your RAG system. The most important leading metric for early-stage systems. Teams that run 10 experiments per week improve faster than teams that run 1 experiment per month. + +**See**: [Chapter 1](chapter1.md) + +--- + +## F + +### Few-shot Learning + +Providing examples in the prompt to guide model behavior. For routing, 10 examples might achieve 88% accuracy while 40 examples reach 95%. + +**See**: [Chapter 6](chapter6-2.md) + +--- + +## H + +### Hard Negative + +A document that appears relevant based on surface features (keywords, topic) but is actually not helpful for answering a specific query. Hard negatives are the most valuable training examples for improving retrieval because they teach the model subtle distinctions. + +**Example**: For the query "Python memory management," a document about "Python snake habitats" is an easy negative (obviously wrong). A document about "Python garbage collection in version 2.7" when the user needs Python 3.11 information is a hard negative (seems relevant but is not). + +**Contrast with**: Easy negative (completely unrelated documents) + +**See**: [Chapter 2](chapter2.md), [Chapter 3](chapter3-1.md) + +### Hybrid Search + +Combining lexical search (keyword matching) with semantic search (embedding similarity). Often outperforms either approach alone because lexical search handles exact matches and rare terms while semantic search handles paraphrasing and conceptual similarity. + +**See**: [Chapter 1](chapter1.md) + +--- + +## I + +### Implicit Feedback + +Signals about user satisfaction derived from behavior rather than explicit ratings. Includes query refinements (user rephrases immediately), abandonment, dwell time, citation clicks, and copy actions. + +**Contrast with**: Explicit feedback (thumbs up/down, ratings) + +**See**: [Chapter 3](chapter3-1.md) + +### Intervention Bias + +The tendency to make changes just to feel like progress is being made, without measuring impact. Manifests as constantly switching models, tweaking prompts, or adding features without clear hypotheses. + +**See**: [Chapter 1](chapter1.md) + +### Inventory Problem + +When a RAG system fails because the answer does not exist in the knowledge base—not because retrieval failed. No amount of better embeddings or re-ranking can fix missing data. + +**Contrast with**: Capabilities problem (answer exists but system cannot find it) + +**See**: [Chapter 0](chapter0.md) + +--- + +## L + +### Lagging Metric + +An outcome metric you care about but cannot directly control: user satisfaction, churn rate, revenue. Like body weight—easy to measure, hard to change directly. + +**Contrast with**: Leading metric + +**See**: [Chapter 1](chapter1.md) + +### Leading Metric + +An actionable metric that predicts future performance and that you can directly influence: experiment velocity, evaluation coverage, feedback collection rate. Like calories consumed—you have direct control. + +**Contrast with**: Lagging metric + +**See**: [Chapter 1](chapter1.md) + +--- + +## P + +### Precision + +Of the documents you retrieved, what percentage were actually relevant? If you returned 10 documents but only 2 were relevant, precision is 20%. + +**Formula**: Precision = (Relevant ∩ Retrieved) / Retrieved + +**Contrast with**: Recall + +**See**: [Chapter 1](chapter1.md) + +### Precision@K + +Precision calculated for the top K results. Precision@5 means: of the top 5 documents returned, how many were relevant? + +--- + +## Q + +### Query Routing + +Directing user queries to the appropriate specialized retriever or tool based on query characteristics. A router that achieves 95% accuracy with retrievers at 82% accuracy yields 78% end-to-end success (0.95 × 0.82). + +**See**: [Chapter 6](chapter6-1.md) + +--- + +## R + +### RAG (Retrieval-Augmented Generation) + +A pattern where relevant documents are retrieved from a knowledge base and provided as context to a language model for generating responses. Combines the knowledge storage of search systems with the language capabilities of LLMs. + +### RAPTOR + +Recursive Abstractive Processing for Tree-Organized Retrieval. A technique for handling long documents by creating hierarchical summaries—summaries of summaries—enabling retrieval at different levels of abstraction. + +**See**: [Chapter 5](chapter5-2.md) + +### Recall + +Of all the relevant documents that exist, what percentage did you find? If there are 10 relevant documents and you found 4, recall is 40%. + +**Formula**: Recall = (Relevant ∩ Retrieved) / Relevant + +**Contrast with**: Precision + +**See**: [Chapter 1](chapter1.md) + +### Recall@K + +Recall calculated when retrieving K documents. Recall@10 means: if you retrieve 10 documents, what percentage of all relevant documents did you find? + +### Re-ranker + +A model that re-scores retrieved documents to improve ranking. Typically a cross-encoder that is more accurate but slower than the initial bi-encoder retrieval. Applied to top-N results (e.g., retrieve 50, re-rank to top 10). + +**Typical improvement**: 12-20% at top-5 + +**See**: [Chapter 2](chapter2.md) + +--- + +## S + +### Semantic Cache + +A cache that returns stored responses for queries that are semantically similar (not just identical) to previous queries. Requires setting a similarity threshold (e.g., 0.95 cosine similarity). + +**See**: [Chapter 7](chapter7.md) + +### Synthetic Data + +Artificially generated evaluation data, typically created by having an LLM generate questions that a document chunk should answer. Used to overcome the cold start problem and establish baselines before real user data exists. + +**See**: [Chapter 1](chapter1.md) + +--- + +## T + +### Trellis Framework + +A framework for organizing production monitoring of AI systems: (1) Discretize infinite outputs into specific buckets, (2) Prioritize by Volume × Negative Sentiment × Achievable Delta × Strategic Relevance, (3) Recursively refine within buckets. + +**See**: [Chapter 1](chapter1.md) + +### Two-Level Performance Formula + +For systems with routing to specialized retrievers, overall success = P(correct router) × P(correct retrieval | correct router). A 95% router with 82% retrieval yields 78% overall, while a 67% router with 80% retrieval yields only 54%. + +**See**: [Chapter 6](chapter6-1.md) + +--- + +## V + +### Vector Database + +A database optimized for storing and querying high-dimensional vectors (embeddings). Supports approximate nearest neighbor search to find similar vectors efficiently. + +**Examples**: Pinecone, ChromaDB, pgvector, LanceDB, Weaviate + +**See**: [Chapter 1](chapter1.md) + +--- + +## W + +### Write-time vs Read-time Computation + +A fundamental architectural trade-off. Write-time computation (preprocessing) increases storage costs but improves query latency. Read-time computation (on-demand) reduces storage but increases latency. Choose based on content stability and latency requirements. + +**See**: [Chapter 7](chapter7.md) + +--- + +## Quick Reference: Key Formulas + +| Metric | Formula | Use Case | +|--------|---------|----------| +| Precision@K | Relevant in top K / K | Measuring result quality | +| Recall@K | Relevant in top K / Total relevant | Measuring coverage | +| End-to-end success | P(router) × P(retrieval) | System performance | +| Prioritization score | Volume × (1 - Satisfaction) × Delta × Relevance | Roadmap planning | + +--- + +*Return to [Workshop Index](index.md)* diff --git a/docs/workshops/how-to-use.md b/docs/workshops/how-to-use.md new file mode 100644 index 0000000..eaa49f3 --- /dev/null +++ b/docs/workshops/how-to-use.md @@ -0,0 +1,193 @@ +--- +title: How to Use This Book +description: Reading paths, prerequisites, and guidance for getting the most from these workshops +authors: + - Jason Liu +date: 2025-04-18 +tags: + - guide + - getting-started +--- + +# How to Use This Book + +This guide helps you navigate the workshops based on your goals, experience level, and available time. + +--- + +## Three Reading Paths + +### Path 1: The Full Journey (Recommended) + +**Time**: 8-12 hours of reading + 10-20 hours of hands-on practice + +**For**: Teams building new RAG systems or significantly improving existing ones + +Read chapters in order from Introduction through Chapter 7. Each chapter builds on the previous one, and the concepts compound. The construction company case study threads through multiple chapters, showing how the same system evolves. + +``` +Introduction → Ch 1 → Ch 2 → Ch 3.1 → Ch 3.2 → Ch 3.3 → Ch 4.1 → Ch 4.2 → Ch 5.1 → Ch 5.2 → Ch 6.1 → Ch 6.2 → Ch 6.3 → Ch 7 +``` + +### Path 2: Quick Wins First + +**Time**: 3-4 hours of reading + 5-10 hours of implementation + +**For**: Teams with existing RAG systems that need immediate improvements + +Start with the chapters that typically deliver the fastest results: + +1. **[Chapter 1](chapter1.md)**: Set up evaluation (you cannot improve what you cannot measure) +2. **[Chapter 3.1](chapter3-1.md)**: Fix feedback collection (often 5x improvement with copy changes) +3. **[Chapter 2](chapter2.md)**: Add re-ranking (12-20% retrieval improvement) +4. **[Chapter 4.1](chapter4-1.md)**: Identify your worst-performing query segments + +Then return to fill gaps as needed. + +### Path 3: Reference Mode + +**Time**: As needed + +**For**: Experienced practitioners looking for specific techniques + +Jump directly to what you need: + +- **Evaluation setup**: [Chapter 1](chapter1.md) +- **Fine-tuning embeddings**: [Chapter 2](chapter2.md) +- **Feedback collection**: [Chapter 3.1](chapter3-1.md) +- **Streaming/latency**: [Chapter 3.2](chapter3-2.md) +- **Query clustering**: [Chapter 4.1](chapter4-1.md) +- **Prioritization**: [Chapter 4.2](chapter4-2.md) +- **Multimodal retrieval**: [Chapter 5.2](chapter5-2.md) +- **Query routing**: [Chapter 6.1](chapter6-1.md), [Chapter 6.2](chapter6-2.md) +- **Production operations**: [Chapter 7](chapter7.md) + +Use the [Glossary](glossary.md) for term definitions and [Quick Reference](quick-reference.md) for formulas and decision trees. + +--- + +## Prerequisites by Chapter + +| Chapter | What You Should Know | +|---------|---------------------| +| **Introduction** | What RAG is at a high level | +| **Chapter 1** | Basic Python, familiarity with embeddings | +| **Chapter 2** | Chapter 1 concepts, basic ML training concepts | +| **Chapter 3.1-3.3** | Web development basics (for UI patterns) | +| **Chapter 4.1-4.2** | Chapter 1 concepts, basic statistics | +| **Chapter 5.1-5.2** | Chapters 1-2, understanding of different data types | +| **Chapter 6.1-6.3** | Chapters 1-5, API design concepts | +| **Chapter 7** | All previous chapters, basic DevOps/infrastructure | + +--- + +## Time Estimates + +| Chapter | Reading | Hands-on Practice | +|---------|---------|-------------------| +| Introduction | 30 min | - | +| Chapter 1 | 45 min | 2-3 hours | +| Chapter 2 | 45 min | 3-4 hours | +| Chapter 3.1 | 30 min | 1-2 hours | +| Chapter 3.2 | 30 min | 2-3 hours | +| Chapter 3.3 | 30 min | 1-2 hours | +| Chapter 4.1 | 45 min | 2-3 hours | +| Chapter 4.2 | 30 min | 1-2 hours | +| Chapter 5.1 | 30 min | 1-2 hours | +| Chapter 5.2 | 45 min | 3-4 hours | +| Chapter 6.1 | 30 min | 1-2 hours | +| Chapter 6.2 | 45 min | 2-3 hours | +| Chapter 6.3 | 30 min | 1-2 hours | +| Chapter 7 | 45 min | 2-3 hours | +| **Total** | **~8 hours** | **~25 hours** | + +--- + +## What You Will Build + +By the end of the full journey, you will have: + +1. **An evaluation framework** with synthetic data and retrieval metrics +2. **A feedback collection system** that gathers 5x more data than typical implementations +3. **Fine-tuned embeddings or re-rankers** tailored to your domain +4. **Query segmentation** showing which user needs are underserved +5. **Specialized retrievers** for different content types +6. **A routing layer** that directs queries to the right tools +7. **Production monitoring** that catches degradation before users notice + +--- + +## Hands-On Practice + +Each chapter includes: + +- **Action Items**: Specific tasks to implement that week +- **Reflection Questions**: Prompts to apply concepts to your system +- **Code Examples**: Patterns you can adapt + +For deeper hands-on practice, the [WildChat Case Study](../../latest/case_study/README.md) walks through a complete RAG improvement cycle with real data: + +| Case Study Part | Related Workshop Chapter | +|-----------------|-------------------------| +| Part 1: Data Exploration | Chapter 1 | +| Part 2: The Alignment Problem | Chapter 2, Chapter 5 | +| Part 3: Solving with Summaries | Chapter 5 | +| Part 4: Advanced Techniques | Chapter 2, Chapter 6 | + +--- + +## Common Questions + +### "I already have a RAG system. Where do I start?" + +Start with [Chapter 1](chapter1.md) to establish evaluation metrics. You cannot improve what you cannot measure. Even if your system is "working," you need baselines to know if changes help or hurt. + +### "I do not have any users yet. Is this relevant?" + +Yes. [Chapter 1](chapter1.md) specifically addresses the cold-start problem using synthetic data. You can build evaluation infrastructure and test improvements before launch. + +### "My team is skeptical about investing time in evaluation." + +Show them the $100M company example from [Chapter 1](chapter1.md)—companies with massive valuations operating with fewer than 30 evaluations. Then show the construction company case study: systematic evaluation led to 27% → 85% recall improvement in four days. + +### "We are using [specific vector database/LLM]. Does this apply?" + +Yes. The concepts are tool-agnostic. Specific code examples use common tools (OpenAI, LanceDB, ChromaDB), but the frameworks apply regardless of your stack. + +### "How do I convince my manager to let me work on this?" + +Frame it in business terms: +- Evaluation prevents shipping regressions (risk reduction) +- Feedback collection generates training data (asset building) +- Query segmentation reveals product opportunities (revenue potential) +- The construction company reduced unit costs from $0.09 to $0.04 per query (cost savings) + +--- + +## Getting Help + +- **Glossary**: [Key terms and definitions](glossary.md) +- **Quick Reference**: [Formulas and decision trees](quick-reference.md) +- **Chapter Index**: [Full workshop listing](index.md) + +--- + +## Suggested Weekly Schedule + +For teams working through the material together: + +| Week | Focus | Chapters | +|------|-------|----------| +| 1 | Foundations | Introduction, Chapter 1 | +| 2 | Improvement Techniques | Chapter 2 | +| 3 | User Experience | Chapters 3.1, 3.2, 3.3 | +| 4 | User Understanding | Chapters 4.1, 4.2 | +| 5 | Specialized Retrieval | Chapters 5.1, 5.2 | +| 6 | System Architecture | Chapters 6.1, 6.2, 6.3 | +| 7 | Production | Chapter 7 | + +Each week: Read the chapters, implement the action items, discuss reflection questions as a team. + +--- + +*Ready to start? Begin with the [Introduction](chapter0.md) or jump to [Chapter 1](chapter1.md) if you are already familiar with the product mindset.* diff --git a/docs/workshops/index.md b/docs/workshops/index.md index aa96f4d..5b633e2 100644 --- a/docs/workshops/index.md +++ b/docs/workshops/index.md @@ -85,11 +85,14 @@ The progression: 4. **Learn from Users** (Ch 4): Find patterns, pick what to build 5. **Go Deep** (Ch 5): Build specialized tools that excel 6. **Tie It Together** (Ch 6): Make everything work as one system +7. **Ship It** (Ch 7): Run reliably in production at scale ## Prerequisites You should know what RAG is and have at least played with it. If you're totally new, start with the [Introduction](chapter0.md). +For guidance on different reading paths and time estimates, see [How to Use This Book](how-to-use.md). + ## What You'll Have When Done A RAG system that: @@ -100,3 +103,9 @@ A RAG system that: - Makes improvement decisions based on data - Handles edge cases gracefully - Works in production, not just demos + +## Reference Materials + +- **[Glossary](glossary.md)** - Definitions of key terms like hard negatives, recall@K, and the data flywheel +- **[Quick Reference](quick-reference.md)** - Formulas, decision trees, and checklists for quick lookup +- **[How to Use This Book](how-to-use.md)** - Reading paths, prerequisites, and time estimates diff --git a/docs/workshops/quick-reference.md b/docs/workshops/quick-reference.md new file mode 100644 index 0000000..58cc0f0 --- /dev/null +++ b/docs/workshops/quick-reference.md @@ -0,0 +1,261 @@ +--- +title: Quick Reference +description: One-page reference for key metrics, formulas, and decision frameworks +authors: + - Jason Liu +date: 2025-04-18 +tags: + - reference + - cheatsheet +--- + +# Quick Reference + +A condensed reference for the key concepts, metrics, and decision frameworks from the workshops. + +--- + +## Core Metrics + +### Retrieval Metrics + +| Metric | Formula | What It Tells You | +|--------|---------|-------------------| +| **Precision@K** | Relevant in top K ÷ K | Are your results relevant? | +| **Recall@K** | Relevant in top K ÷ Total relevant | Are you finding everything? | +| **MRR** | 1 ÷ Rank of first relevant | How quickly do you find something useful? | + +**Rule of thumb**: With modern LLMs, prioritize recall over precision. They handle irrelevant context well. + +### System Performance + +| Metric | Formula | Target | +|--------|---------|--------| +| **End-to-end success** | P(router correct) × P(retrieval correct) | 75%+ | +| **Feedback rate** | Feedback submissions ÷ Total queries | 0.5%+ (5x better than typical) | +| **Experiment velocity** | Experiments run per week | 5-10 for early systems | + +--- + +## Decision Frameworks + +### Is It an Inventory Problem or Capabilities Problem? + +``` +Can a human expert find the answer by manually searching? + │ + ├── NO → Inventory Problem + │ Fix: Add missing content + │ + └── YES → Capabilities Problem + Fix: Improve retrieval/routing +``` + +### Should You Fine-tune or Use a Re-ranker? + +``` +Do you have 5,000+ labeled examples? + │ + ├── NO → Use re-ranker (12-20% improvement, no training needed) + │ + └── YES → Do you have hard negatives? + │ + ├── NO → Mine hard negatives first, then fine-tune + │ + └── YES → Fine-tune embeddings (6-10% improvement) +``` + +### Write-time vs Read-time Computation + +| Factor | Write-time (Preprocess) | Read-time (On-demand) | +|--------|------------------------|----------------------| +| Content changes | Rarely | Frequently | +| Latency requirements | Strict (<100ms) | Flexible (1-2s OK) | +| Storage budget | Available | Constrained | +| Query patterns | Predictable | Unpredictable | + +--- + +## Cost Estimation + +### Quick Cost Formula + +``` +Monthly cost = + (Documents × Tokens/doc × Embedding cost) # One-time + + (Queries/day × 30 × Input tokens × Input cost) # Recurring + + (Queries/day × 30 × Output tokens × Output cost) # Recurring + + Infrastructure # Fixed +``` + +### Typical Cost Breakdown + +- Embedding generation: 5-10% +- Retrieval infrastructure: 10-20% +- LLM generation: 60-75% +- Logging/monitoring: 5-10% + +### Cost Reduction Levers + +| Technique | Typical Savings | Complexity | +|-----------|----------------|------------| +| Prompt caching | 70-90% on repeat queries | Low | +| Semantic caching | 20-30% | Medium | +| Self-hosted embeddings | 50-80% on embedding costs | High | +| Smaller context windows | 30-50% on generation | Low | + +--- + +## Prioritization Matrix + +### The 2x2 for Query Segments + +``` + High Volume + │ + ┌───────────────┼───────────────┐ + │ DANGER │ STRENGTH │ + │ Fix first │ Maintain │ + │ │ │ +Low ─────┼───────────────┼───────────────┼───── High +Satisfaction │ Satisfaction + │ │ │ + │ MONITOR │ OPPORTUNITY │ + │ Low priority│ Expand │ + │ │ │ + └───────────────┼───────────────┘ + │ + Low Volume +``` + +### Prioritization Score + +``` +Score = Volume% × (1 - Satisfaction%) × Achievable Delta × Strategic Relevance +``` + +**Example**: Scheduling queries are 8% of volume, 25% satisfaction, 50% achievable improvement, high strategic relevance → High priority fix + +--- + +## Feedback Copy That Works + +### Do Use + +- "Did we answer your question?" (5x better than generic) +- "Did this run do what you expected?" +- "Was this information helpful for your task?" + +### Do Not Use + +- "How did we do?" (too vague) +- "Rate your experience" (users think you mean UI) +- "Was this helpful?" (without context) + +### After Negative Feedback + +Ask specific follow-up: +- "Was the information wrong?" +- "Was something missing?" +- "Was it hard to understand?" + +--- + +## Chunking Defaults + +| Content Type | Chunk Size | Overlap | Notes | +|--------------|-----------|---------|-------| +| General text | 800 tokens | 50% | Good starting point | +| Legal/regulatory | 1500-2000 tokens | 30% | Preserve full clauses | +| Technical docs | 400-600 tokens | 40% | Precise retrieval | +| Conversations | Page-level | Minimal | Maintain context | + +**Warning**: Chunk optimization rarely gives >10% improvement. Focus on query understanding and metadata filtering first. + +--- + +## Vector Database Selection + +``` +Do you have existing PostgreSQL expertise? + │ + ├── YES → Is your dataset < 1M vectors? + │ │ + │ ├── YES → pgvector + │ └── NO → pgvector_scale or migrate + │ + └── NO → Do you want managed infrastructure? + │ + ├── YES → Pinecone + │ + └── NO → Want hybrid search experiments? + │ + ├── YES → LanceDB + └── NO → ChromaDB (prototypes) or Turbopuffer (performance) +``` + +--- + +## Routing Performance + +### Few-shot Examples Impact + +| Examples | Typical Accuracy | +|----------|-----------------| +| 5 | 75-80% | +| 10 | 85-88% | +| 20 | 90-92% | +| 40 | 94-96% | + +### End-to-end Impact + +| Router Accuracy | Retrieval Accuracy | Overall Success | +|-----------------|-------------------|-----------------| +| 67% | 80% | 54% | +| 85% | 80% | 68% | +| 95% | 82% | 78% | +| 98% | 85% | 83% | + +--- + +## Production Checklist + +### Before Launch + +- [ ] Baseline metrics established (Recall@5, Precision@5) +- [ ] 50+ evaluation examples covering main query types +- [ ] Feedback mechanism visible and specific +- [ ] Error handling and fallbacks implemented +- [ ] Cost monitoring in place + +### Weekly Review + +- [ ] Check retrieval metrics for degradation +- [ ] Review negative feedback submissions +- [ ] Analyze new query patterns +- [ ] Run at least 2 experiments +- [ ] Update evaluation set with edge cases + +### Monthly Review + +- [ ] Cost trend analysis +- [ ] Query segment performance comparison +- [ ] Model/embedding update evaluation +- [ ] Roadmap prioritization refresh + +--- + +## Key Numbers to Remember + +| Metric | Typical | Good | Excellent | +|--------|---------|------|-----------| +| Feedback rate | 0.1% | 0.5% | 2%+ | +| Recall@10 | 50% | 75% | 90%+ | +| Router accuracy | 70% | 90% | 95%+ | +| Re-ranker improvement | 5% | 12% | 20%+ | +| Fine-tuning improvement | 3% | 6% | 10%+ | +| Hard negative boost | 6% | 15% | 30%+ | + +--- + +*Return to [Workshop Index](index.md) | See [Glossary](glossary.md) for term definitions* diff --git a/mkdocs.yml b/mkdocs.yml index 5549e82..81df022 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -39,15 +39,16 @@ nav: - "Home": "index.md" - "Workshops": - "Overview": "workshops/index.md" + - "How to Use This Book": "workshops/how-to-use.md" - "Introduction": "workshops/chapter0.md" - "Chapter 1: Starting the Flywheel": - "Overview": "workshops/chapter1.md" - "Chapter 2: From Evaluation to Enhancement": - "Overview": "workshops/chapter2.md" - "Chapter 3: User Experience": - - "Design Principles": "workshops/chapter3-1.md" - - "Feedback Collection": "workshops/chapter3-2.md" - - "Iterative Improvement": "workshops/chapter3-3.md" + - "Feedback Collection": "workshops/chapter3-1.md" + - "Overcoming Latency": "workshops/chapter3-2.md" + - "Quality of Life": "workshops/chapter3-3.md" - "Chapter 4: Topic Modeling": - "Analysis": "workshops/chapter4-1.md" - "Prioritization": "workshops/chapter4-2.md" @@ -58,6 +59,12 @@ nav: - "Routing": "workshops/chapter6-1.md" - "Tools": "workshops/chapter6-2.md" - "Improvement": "workshops/chapter6-3.md" + - "Chapter 7: Production": + - "Overview": "workshops/chapter7.md" + - "Reference": + - "Glossary": "workshops/glossary.md" + - "Quick Reference": "workshops/quick-reference.md" + - "Exercises": "workshops/exercises.md" - "Office Hours": - "office-hours/index.md" - "FAQ": "office-hours/faq.md"