Legal RAG Bench

This repository contains the code used to evaluate RAG systems on Legal RAG Bench.

If you're looking for the data behind Legal RAG Bench, you can find it here. A full interactive write up of how Legal RAG Bench was built is also available here.

Setup

Install dependencies from requirements.txt:

pip install -r requirements.txt

We recommend creating a .env file in the repository root to store API keys for the providers you plan to use:

# Isaacus
ISAACUS_API_KEY=...

# OpenAI
OPENAI_API_KEY=...

# Google
GOOGLE_API_KEY=...

Replace ... with your actual keys. You can omit any providers you won’t be using.

Basic Usage

The default configuration evaluates n = 6 RAG pipeline permutations:

Embedding models (retrieval at k=5 only):
- kanon-2-embedder
- text-embedding-3-large
- gemini-embedding-001
Generative models:
- gpt-5.2
- gemini-3.1-pro-preview
Judge LLM:
- gpt-5.2 with reasoning_effort="high"

Counts:

Runs = emb × gen × nK (default: 3 × 2 × 1 = 6)
Iterations = emb × gen × nK × nQuestions (default: 3 × 2 × 1 × 100 = 600)

To run the default evaluation end-to-end:

Build vector DBs (one per embedding model):
```
python db.py
```
Run the evaluation over the full QA dataset for each emb × gen pairing:
```
python eval.py
```
Inspect results in ./results:
- Results are organized into folders per embedding model
- Each embedding × generative pairing produces a .jsonl file

Each .jsonl contains one row per question (and per k, if multiple k values are used). Rows include metadata (labels, IDs), the RAG answer, the human-annotated answer, retrieved context documents, and the judge’s assessment.

To extract the judge outcome, read the "judge_verdict" field:

correct: true/false — whether the judge deemed the answer correct given the human-annotated answer
grounded: true/false — whether the judge deemed the answer grounded in the provided context
reasoning: str — the judge’s explanation for correctness and grounding

The "relevant_passage_in_context" field is also useful for determining if the retrieval model was able to deliver the relevant passage as context to the generative model.

Advanced Usage

Before running db.py and eval.py, you can tune or evaluate a custom configuration via the following knobs.

Select different embedding models, generative models, judges, or `k` values

Edit config.py to define new embedding and/or generative models using LangChain integrations. For example:

# Example embedding model
kanon2 = {
    "model": IsaacusEmbeddings(
        model="kanon-2-embedder",
        api_key=ISAACUS_API_KEY,
    ),
    "model_name": "kanon2",
}

# Example generative model
gpt52 = {
    "model": ChatOpenAI(
        model="gpt-5.2",
        api_key=OPENAI_API_KEY,
        temperature=0,
        reasoning_effort="none",
        seed=42,
    ),
    "model_name": "gpt52",
}

If your changes require additional LangChain packages or new provider API keys, update imports and your .env accordingly.

Add your models to the lists the evaluation iterates over:

# For embedding models, add new models to this list
embedding_models = [kanon2, ...]

# For generative models, add new models to this list
generative_models = [gpt52, ...]

Change the judge model by assigning a different LangChain LLM to judge_model:

judge_model = {
    "model": ChatOpenAI(
        model="gpt-5.2",
        api_key=OPENAI_API_KEY,
        reasoning_effort="high",
        seed=42,
    ),
    "model_name": "gpt52_judge",
}

Evaluate different retrieval depths by editing the Ks list:

# Number of retrieved documents provided as context to the generative model
Ks = [5, ...]

Use different prompts for the generator or judge

Edit prompts.py:

Update RAG_PROMPT to change the generative model prompt/style
Update JUDGE_PROMPT to change the judge rubric/prompt

The default prompts are tuned for strong performance on the Legal RAG Bench corpus.

Reproducibility

The results reported for Legal RAG Bench were generated on the 20th of Feburary 2026, shortly after the public preview release of Gemini 3.1 Pro. You may observe different outputs and/or benchmark scores across runs, even with the same prompts and data, for several reasons:

LLM outputs are non-deterministic. APIs can return different outputs across requests with identical inputs. Setting temperature=0 may reduce variance but does not guarantee determinism. If your provider supports it, consider using a fixed seed and recording any response metadata to aid reproducibility. Note that doing this will deviate from the default settings tested in Legal RAG Bench.
Model providers update models over time. A model “name” or alias may refer to an evolving system. Provider-side updates (weights, routing, safety layers, tool policies, decoding defaults, etc.) can change behavior and benchmark scores between dates, sometimes without a clear semantic-version signal.
LangChain and other dependencies change quickly. Updates can modify prompt formatting, tokenization, retriever defaults, retry logic, and other behaviors that materially affect results.
Runtime/infrastructure differences matter. Python versions, OS, vector DB implementations, embedding libraries, and hardware can affect retrieval behavior, latency, and occasionally outputs.

License

This project is licensed under the MIT license.

Citation

@misc{butler2026legalragbench,
      title={Legal RAG Bench: an end-to-end benchmark for legal RAG}, 
      author={Abdur-Rahman Butler and Umar Butler},
      year={2026},
      eprint={2603.01710},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.01710}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
db.py		db.py
eval.py		eval.py
prompts.py		prompts.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal RAG Bench

Setup

Basic Usage

Advanced Usage

Select different embedding models, generative models, judges, or `k` values

Use different prompts for the generator or judge

Reproducibility

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legal RAG Bench

Setup

Basic Usage

Advanced Usage

Select different embedding models, generative models, judges, or k values

Use different prompts for the generator or judge

Reproducibility

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Select different embedding models, generative models, judges, or `k` values

Packages