A modular, research-grade evaluation and experimentation platform for Retrieval-Augmented Generation (RAG) pipelines. Built with Bun, TypeScript, React, and Python for high-performance evaluation of retrieval quality.
- Modular Architecture: Extend with custom preprocessors, filters, and search types
- Multiple Search Types: Vector, BM25, and Hybrid search out of the box
- Comprehensive Metrics: Precision, Recall, MRR, NDCG, and LLM-judged metrics
- Beautiful Dashboard: Modern React UI for configuration and visualization
- Domain Agnostic: Works with any RAG system via pluggable modules
- Bun v1.0+
- Python 3.10+ with pip
- (Optional) Ollama for LLM-judged metrics
# Clone the repository
git clone https://github.com/your-org/rag-lab.git
cd rag-lab
# Install dependencies
bun install
# Set up Python environment
python -m venv .venv-textdb
source .venv-textdb/bin/activate # or .venv-textdb\Scripts\activate on Windows
pip install -r python/requirements-text.txt
# Start the servers
bun run dev # Backend on :3100
cd web && bun run dev # Frontend on :3101Visit http://localhost:3101 to access the dashboard.
| Metric | Description |
|---|---|
| Precision@K | Proportion of retrieved documents that are relevant |
| Recall@K | Proportion of relevant documents that were retrieved |
| Hit Rate@K | Whether at least one relevant document exists in top K |
| MRR | Mean Reciprocal Rank - average of 1/rank of first relevant |
| NDCG | Normalized Discounted Cumulative Gain |
| F1@K | Harmonic mean of precision and recall |
| Metric | Description |
|---|---|
| Faithfulness | Is the answer grounded in retrieved context? |
| Answer Relevancy | Does the answer address the query? |
| Answer Correctness | Is the answer factually accurate? |
RAG-Lab uses a pluggable module system for extensibility. Modules are discovered automatically from the modules/ directory.
-
Query Preprocessors - Transform queries before retrieval
- Query expansion, synonym mapping, intent detection
-
Document Filters - Filter/rerank documents after retrieval
- Relevance scoring, deduplication, domain filtering
-
Search Types - Different retrieval strategies
- Vector (dense), BM25 (sparse), Hybrid (combined)
# modules/my_module/my_preprocessor.py
from rag_bench.modules.base import QueryPreprocessor, ModuleConfig
class MyPreprocessor(QueryPreprocessor):
MODULE_ID = "my-preprocessor"
MODULE_NAME = "My Preprocessor"
MODULE_DESCRIPTION = "Enhances queries with custom logic"
@classmethod
def get_config_schema(cls):
return [
ModuleConfig(
key="intensity",
type="number",
label="Intensity",
default=0.5,
min=0.0,
max=1.0,
),
]
def process(self, query, context):
enhanced = self._enhance(query)
return enhanced, context# modules/my_module/__init__.py
from .my_preprocessor import MyPreprocessor
def register(registry):
registry.register(MyPreprocessor)See Module Development Guide for complete documentation.
Dense embedding similarity search using cosine similarity.
# Best for semantic understanding
search_type: "vector"Sparse lexical search using the BM25 algorithm.
# Best for keyword matching
search_type: "bm25"
variants: ["bm25", "bm25_no_idf", "tf"]Combines vector and BM25 for best results.
# Best overall performance
search_type: "hybrid"
variants: ["weighted", "rrf"]
config:
vector_weight: 0.5
lexical_weight: 0.5RAG-Lab/
βββ src/ # TypeScript API server
β βββ api/ # REST endpoints
β βββ core/ # Evaluation engine
β βββ modules/ # Module system (TS)
β βββ integrations/ # Backend integrations
β
βββ python/ # Python runtime
β βββ rag_bench/
β βββ modules/ # Module system core
β βββ search/ # Built-in search types
β βββ query.py # Query runner
β
βββ modules/ # User modules (gitignored)
β
βββ web/ # React dashboard
β βββ src/
β βββ components/ # UI components
β βββ lib/ # API client
β
βββ datasets/ # Evaluation datasets
β βββ general/ # General benchmarks
β
βββ data/ # Runtime data
β βββ text_dbs/ # Vector stores
β
βββ docs/ # Documentation
βββ architecture.md # System architecture
βββ module-development.md
βββ agent-instructions.md
- Architecture Overview
- Module Development Guide
- AGENTS.md - Instructions for AI assistants working on this codebase
GET /api/modules # List all modules
GET /api/modules/search-types # List search types
POST /api/modules/refresh # Re-discover modules
POST /api/evaluations/start # Start evaluation
GET /api/evaluations/:id/status
GET /api/evaluations/results
GET /api/datasets # List datasets
GET /api/datasets/:id/summary
GET /api/textdb/list # List vector stores
POST /api/textdb/build # Build new store
POST /api/textdb/active # Set active store
{
"id": "my-dataset",
"name": "My Evaluation Dataset",
"testCases": [
{
"id": "test-1",
"query": "How do I implement feature X?",
"category": "features",
"difficulty": "medium",
"groundTruth": {
"expectedKeywords": ["feature", "implement", "X"],
"relevantChunks": ["feature documentation"],
"referenceAnswer": "To implement feature X..."
}
}
]
}# Ollama (for LLM metrics)
OLLAMA_HOST=http://localhost
OLLAMA_PORT=11434
OLLAMA_MODEL=mistral:latest
# Embeddings
TEXT_EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
TEXT_EMBEDDING_DEVICE=cpu
# Server
PORT=3100{
kValues: [5, 10, 15, 20],
enableGenerationMetrics: true,
integrationMode: 'text',
searchType: 'vector',
moduleConfig: {
"my-preprocessor": {
enabled: true,
config: { intensity: 0.8 }
}
}
}- A/B Testing - Compare retrieval strategies
- Hyperparameter Tuning - Find optimal K, chunk sizes
- Model Comparison - Evaluate embedding models
- Regression Testing - Ensure changes don't degrade quality
- Domain Adaptation - Create domain-specific modules
# Run backend with hot reload
bun run dev
# Run frontend
cd web && bun run dev
# Type checking
bun run typecheck
# Build for production
bun run build
cd web && bun run build