jxnl · jxnl · Jan 17, 2026 · Jan 17, 2026
diff --git a/docs/workshops/chapter0.md b/docs/workshops/chapter0.md
@@ -242,6 +242,14 @@ They added a simple feature: when someone asks that, the AI recommends the three
 
 Consider how this played out with a legal tech company building case law search:
 
+| Month | Focus | Overall Accuracy | Key Change |
+|-------|-------|-----------------|------------|
+| 1 | Baseline | 63% | Generated 200 test queries |
+| 2 | Chunking | 72% | Fixed legal citation splitting |
+| 3 | Deployment | 72% | Added feedback collection |
+| 4-5 | Discovery | 72% | Identified 3 query patterns |
+| 6 | Specialization | 87% | Built dedicated retrievers |
+
 **Month 1 - Baseline:** Basic RAG with standard embeddings. Lawyers complained it "never found the right cases." We generated 200 test queries from their actual case law. Baseline accuracy: 63%.
 
 **Month 2 - First Iteration:** Testing different approaches revealed that legal jargon broke standard chunking. Legal citations like "42 U.S.C. § 1983" were being split across chunks, destroying meaning. Fixed the chunking strategy to respect legal citation patterns. Accuracy improved to 72%.
@@ -250,9 +258,11 @@ Consider how this played out with a legal tech company building case law search:
 
 **Months 4-5 - Pattern Discovery:** After 2 months and 5,000 queries, three distinct patterns emerged:
 
-- Case citations: 40% of queries, 91% accuracy (worked great)
-- Legal definitions: 35% of queries, 78% accuracy (acceptable)
-- Procedural questions: 25% of queries, 34% accuracy (total failure)
+| Query Type | Volume | Accuracy | Status |
+|------------|--------|----------|--------|
+| Case citations | 40% | 91% | Working well |
+| Legal definitions | 35% | 78% | Acceptable |
+| Procedural questions | 25% | 34% | Failing |
 
 **Month 6 - Specialized Solutions:** Built dedicated retrieval strategies for each type. Case citations got exact matching on citation format. Definitions got a specialized glossary index. Procedural questions got a separate index built from court rules and practice guides. Overall accuracy jumped to 87%.
 
@@ -302,22 +312,40 @@ Learn how to overcome the cold-start problem through synthetic data generation,
 
 Discover how to transform evaluation insights into concrete product improvements through fine-tuning, re-ranking, and targeted capability development.
 
-### [Chapter 3: The User Experience of AI](chapter3-1.md)
+### Chapter 3: The User Experience of AI
+
+Explore how to design interfaces that both delight users and gather valuable feedback, creating the virtuous cycle at the heart of the improvement flywheel. This chapter has three parts:
 
-Explore how to design interfaces that both delight users and gather valuable feedback, creating the virtuous cycle at the heart of the improvement flywheel.
+- [Chapter 3.1: Feedback Collection](chapter3-1.md) - Getting users to actually give feedback
+- [Chapter 3.2: Overcoming Latency](chapter3-2.md) - Making RAG feel fast
+- [Chapter 3.3: Quality of Life](chapter3-3.md) - Small changes with big impact
 
-### [Chapter 4: Understanding Your Users](chapter4-1.md)
+### Chapter 4: Understanding Your Users
 
 Learn techniques for segmenting users and queries to identify high-value opportunities and create prioritized improvement roadmaps.
 
-### [Chapter 5: Building Specialized Capabilities](chapter5-1.md)
+- [Chapter 4.1: Topic Modeling](chapter4-1.md) - Finding patterns in user data
+- [Chapter 4.2: Prioritization](chapter4-2.md) - Deciding what to build next
+
+### Chapter 5: Building Specialized Capabilities
 
 Develop purpose-built solutions for different user needs, spanning documents, images, tables, and structured data.
 
-### [Chapter 6: Unified Product Architecture](chapter6-1.md)
+- [Chapter 5.1: Understanding Specialization](chapter5-1.md) - When one size does not fit all
+- [Chapter 5.2: Implementation](chapter5-2.md) - Search beyond text
+
+### Chapter 6: Unified Product Architecture
 
 Create a cohesive product experience that intelligently routes to specialized components while maintaining a seamless user experience.
 
+- [Chapter 6.1: Query Routing](chapter6-1.md) - Routing basics
+- [Chapter 6.2: Tool Interfaces](chapter6-2.md) - Building the router
+- [Chapter 6.3: Measurement](chapter6-3.md) - Measuring and improving routers
+
+### [Chapter 7: Production Considerations](chapter7.md)
+
+Keep the improvement flywheel spinning at scale. Learn cost optimization strategies, monitoring approaches that connect back to your evaluation metrics, graceful degradation patterns, and how to maintain improvement velocity as usage grows from hundreds to thousands of daily queries.
+
 ## How You'll Know It's Working
 
 Here's what changes when you get this right:
@@ -351,3 +379,8 @@ Next up: we'll dive into the first step of the flywheel—creating synthetic dat
 _Note: This approach has been applied across legal, finance, healthcare, and e-commerce domains. The details change, but the core flywheel stays the same: focus on users, measure what matters, and improve based on data instead of hunches._
 
 ---
+
+## Navigation
+
+- **Next**: [Chapter 1: Starting the Flywheel](chapter1.md) - Synthetic data and evaluation
+- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md) | [How to Use This Book](how-to-use.md)
diff --git a/docs/workshops/chapter1.md b/docs/workshops/chapter1.md
@@ -13,6 +13,15 @@ tags:
 
 # Kickstarting the Data Flywheel with Synthetic Data
 
+!!! abstract "Chapter at a Glance"
+    **Time**: 45 min reading + 2-3 hours hands-on | **Prerequisites**: Basic Python, familiarity with embeddings
+
+    **You will learn**: How to build evaluation frameworks using synthetic data, measure retrieval with precision/recall, and avoid common pitfalls like vague metrics and intervention bias.
+
+    **Key outcome**: An evaluation pipeline that lets you measure improvements objectively before you have real users.
+
+    **Case studies**: Consulting firm (50% → 90% recall), Blueprint search (27% → 85% in 4 days)
+
 ### Key Insight
 
 **You can't improve what you can't measure—and you can measure before you have users.** Synthetic data isn't just a stopgap until real users arrive. It's a powerful tool for establishing baselines, testing edge cases, and building the evaluation infrastructure that will power continuous improvement. Start with retrieval metrics (precision and recall), not generation quality, because they're faster, cheaper, and more objective.
@@ -372,6 +381,13 @@ Two examples demonstrate how focusing on retrieval metrics leads to rapid improv
 
 A consulting firm generates reports from user research interviews. Consultants conduct 15-30 interviews per project and need AI-generated summaries that capture all relevant insights.
 
+| Stage | Recall | Time | Key Change |
+|-------|--------|------|------------|
+| Baseline | 50% | - | Missing half of relevant quotes |
+| Iteration 1 | 70% | 1 week | Identified chunking issue |
+| Iteration 2 | 85% | 2 weeks | Fixed Q&A splitting |
+| Iteration 3 | 90% | 3 weeks | Added chunk overlap |
+
 **Problem**: Reports were missing critical quotes. A consultant knew 6 experts said something similar, but the report only cited 3. That 50% recall rate destroyed trust. Consultants started spending hours manually verifying reports, defeating the automation's purpose.
 
 **Investigation**: Built manual evaluation sets from problematic examples. The issues turned out to be surprisingly straightforward—text chunking was breaking mid-quote and splitting speaker attributions from their statements.
@@ -384,6 +400,12 @@ A consulting firm generates reports from user research interviews. Consultants c
 
 A construction technology company needed AI search for building blueprints. Workers asked questions like "Which rooms have north-facing windows?" or "Show me all electrical outlet locations in the second-floor bedrooms."
 
+| Stage | Recall | Time | Key Change |
+|-------|--------|------|------------|
+| Baseline | 27% | - | Text embeddings on blueprints |
+| Vision captions | 85% | 4 days | Added spatial descriptions |
+| Counting queries | 92% | +2 weeks | Bounding box detection |
+
 **Problem**: Only 27% recall when finding the right blueprint sections for questions. Workers would ask simple spatial questions and get completely unrelated blueprint segments. The system was essentially useless—workers abandoned it and went back to manually scrolling through PDFs.
 
 **Investigation**: Standard text embeddings couldn't handle the spatial and visual nature of blueprint queries. "North-facing windows" and "electrical outlets" are visual concepts that don't translate well to text chunks.
@@ -834,6 +856,9 @@ Example: One team found 5 irrelevant documents (even marked as "potentially less
 - **[Prompttools](https://github.com/promptslab/prompttools)**: Toolkit for testing and evaluating LLM applications
 - **[MLflow for Experiment Tracking](https://mlflow.org/)**: Open-source platform for managing ML lifecycle
 
+!!! tip "Hands-On Practice"
+    For step-by-step exercises to apply these concepts, see [Exercises: Chapter 1](exercises.md#chapter-1-evaluation-foundations).
+
 ## This Week's Action Items
 
 Based on the content covered, here are your specific tasks:
@@ -894,6 +919,14 @@ Take a minute to think about:
 4. What experiment could you run this week to test an improvement hypothesis?
 5. How will you incorporate real user feedback as it comes in?
 
+!!! example "Hands-On Practice: WildChat Case Study"
+    Apply these concepts with the WildChat case study, which demonstrates the evaluation framework in action:
+
+    - **[Part 1: Data Exploration](../../latest/case_study/teaching/part01/README.md)** - Understand your dataset before building evaluations
+    - **[Part 2: The Alignment Problem](../../latest/case_study/teaching/part02/README.md)** - See how synthetic query generation (v1 vs v2) reveals system limitations
+
+    The case study shows a real example of going from 12% to 62% Recall@1 by understanding what you are actually measuring.
+
 ## Conclusion and Next Steps
 
 We've covered the foundation for systematic RAG improvement through proper evaluation. No more subjective judgments or random changes - you now have tools to measure progress objectively and make data-driven decisions.
@@ -917,3 +950,9 @@ The goal isn't chasing the latest AI techniques. It's building a flywheel of con
 As one client told me: "We spent three months trying to improve through prompt engineering and model switching. In two weeks with proper evaluations, we made more progress than all that time combined."
 
 ---
+
+## Navigation
+
+- **Previous**: [Introduction: Beyond Implementation to Improvement](chapter0.md) - The product mindset for RAG
+- **Next**: [Chapter 2: From Evaluation to Enhancement](chapter2.md) - Converting evaluations into training data
+- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md)
diff --git a/docs/workshops/chapter2.md b/docs/workshops/chapter2.md
@@ -133,7 +133,7 @@ Build a library of few-shot examples systematically:
 
 ### Structured Few-Shot Prompt Example
 
-```
+```text
 You are an assistant specialized in answering questions about [domain].
 
 Here are some examples of how to answer questions:
@@ -244,13 +244,13 @@ For RAG applications, there are several natural ways to create triplet datasets:
 
 Imagine a healthcare RAG application where a user asks:
 
-```
+```text
 What are the side effects of medication X?
 ```
 
 Our retrieval system might return several documents, including:
 
-```
+```text
 Document A: "Medication X may cause drowsiness, nausea, and in rare cases, allergic reactions."
 
 Document B: "Medication X is used to treat high blood pressure and should be taken with food."
@@ -605,33 +605,28 @@ Linear adapters add a small trainable layer on top of frozen embeddings: - Train
 ## Additional Resources
 
 !!! info "Tools and Libraries"
+    **Understanding Embedding Models**
 
-```
-### Understanding Embedding Models
+    1. **Sentence Transformers Library** ([https://www.sbert.net/](https://www.sbert.net/)): This library provides easy-to-use implementations for state-of-the-art embedding models, supporting both pairwise datasets and triplets for fine-tuning. It's my recommended starting point for most teams due to its balance of performance and ease of use.
 
-1. **Sentence Transformers Library** ([https://www.sbert.net/](https://www.sbert.net/)): This library provides easy-to-use implementations for state-of-the-art embedding models, supporting both pairwise datasets and triplets for fine-tuning. It's my recommended starting point for most teams due to its balance of performance and ease of use.
+    2. **Modern BERT** ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)): These newer models offer 8,000 token sequence lengths and generally outperform classic BERT-based models. The BGE models in particular have shown excellent performance across many domains and are worth testing in your applications.
 
-2. **Modern BERT** ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)): These newer models offer 8,000 token sequence lengths and generally outperform classic BERT-based models. The BGE models in particular have shown excellent performance across many domains and are worth testing in your applications.
+    3. **Cohere Re-ranking Models** ([https://cohere.com/rerank](https://cohere.com/rerank)): Cohere offers state-of-the-art re-ranking capabilities with a fine-tuning API that makes it relatively easy to customize for your specific needs. In my experience, even their base re-ranker without fine-tuning often provides substantial improvements to retrieval quality.
 
-3. **Cohere Re-ranking Models** ([https://cohere.com/rerank](https://cohere.com/rerank)): Cohere offers state-of-the-art re-ranking capabilities with a fine-tuning API that makes it relatively easy to customize for your specific needs. In my experience, even their base re-ranker without fine-tuning often provides substantial improvements to retrieval quality.
+    4. **Specialized Domains**: For specific domains like code, science, or legal documents, look for models pre-trained on related corpora. For example, CodeBERT for programming or SciBERT for scientific literature can provide better starting points than general models.
 
-4. **Specialized Domains**: For specific domains like code, science, or legal documents, look for models pre-trained on related corpora. For example, CodeBERT for programming or SciBERT for scientific literature can provide better starting points than general models.
-
-5. **Comparison to Data Labeling**: Everything we're doing today with fine-tuning embedding models is what I used to pay data labeling teams hundreds of thousands of dollars to do annually. The ML playbook that was once only accessible to large companies with significant budgets is now available to teams of all sizes thanks to advances in transfer learning and fine-tuning techniques.
-```
+    5. **Comparison to Data Labeling**: Everything we're doing today with fine-tuning embedding models is what I used to pay data labeling teams hundreds of thousands of dollars to do annually. The ML playbook that was once only accessible to large companies with significant budgets is now available to teams of all sizes thanks to advances in transfer learning and fine-tuning techniques.
 
 !!! info "Key Concepts"
+    **Contrastive Learning In-Depth**
 
-```
-#### Contrastive Learning In-Depth
-
-Contrastive learning trains models to recognize similarities and differences between items by pushing and pulling examples in the embedding space:
+    Contrastive learning trains models to recognize similarities and differences between items by pushing and pulling examples in the embedding space:
 
-- **Triplet Loss**: Optimizes the distance between anchor-positive pairs relative to anchor-negative pairs
-- **InfoNCE Loss**: Contrasts a positive pair against multiple negative examples
-- **Multiple Negatives Ranking Loss**: Handles batches of queries with multiple negatives per query
+    - **Triplet Loss**: Optimizes the distance between anchor-positive pairs relative to anchor-negative pairs
+    - **InfoNCE Loss**: Contrasts a positive pair against multiple negative examples
+    - **Multiple Negatives Ranking Loss**: Handles batches of queries with multiple negatives per query
 
-#### Scaling and Efficiency Considerations
+    **Scaling and Efficiency Considerations**
 
 For large datasets or production workloads:
 
@@ -702,6 +697,14 @@ Take a minute to think about:
 4. If you had to prioritize one retrieval improvement for your system, would it be embeddings, re-ranking, or something else? Why?
 5. What experiments could you run to test your hypotheses about improving retrieval quality?
 
+!!! example "Hands-On Practice: WildChat Case Study"
+    The case study demonstrates the alignment problem and how to solve it through better embeddings:
+
+    - **[Part 2: The Alignment Problem](../../latest/case_study/teaching/part02/README.md)** - See how v1 queries achieve 62% recall while v2 queries get only 12% on the same embeddings
+    - **[Part 3: Solving Through Summaries](../../latest/case_study/teaching/part03/README.md)** - Learn how changing what you embed (not just how you embed) can achieve 358% improvement
+
+    This demonstrates the core insight: alignment between queries and embeddings matters more than model sophistication.
+
 ## Conclusion and Next Steps
 
 We covered a lot:
@@ -740,4 +743,9 @@ Do these things now:
 If you do this right, every piece of data makes your system better. The improvements compound over time and affect everything—clustering, topic modeling, all of it.
 
 ---
-```
+
+## Navigation
+
+- **Previous**: [Chapter 1: Starting the Flywheel](chapter1.md) - Synthetic data and evaluation
+- **Next**: [Chapter 3.1: Feedback Collection](chapter3-1.md) - Getting users to actually give feedback
+- **Reference**: [Glossary](glossary.md) | [Quick Reference](quick-reference.md)