This repository contains the code used to evaluate RAG systems on Legal RAG Bench.
If you're looking for the data behind Legal RAG Bench, you can find it here. A full interactive write up of how Legal RAG Bench was built is also available here.
Install dependencies from requirements.txt:
pip install -r requirements.txtWe recommend creating a .env file in the repository root to store API keys for the providers you plan to use:
# Isaacus
ISAACUS_API_KEY=...
# OpenAI
OPENAI_API_KEY=...
# Google
GOOGLE_API_KEY=...Replace ... with your actual keys. You can omit any providers you won’t be using.
The default configuration evaluates n = 6 RAG pipeline permutations:
- Embedding models (retrieval at
k=5only):kanon-2-embeddertext-embedding-3-largegemini-embedding-001
- Generative models:
gpt-5.2gemini-3.1-pro-preview
- Judge LLM:
gpt-5.2withreasoning_effort="high"
Counts:
- Runs = emb × gen × nK (default: 3 × 2 × 1 = 6)
- Iterations = emb × gen × nK × nQuestions (default: 3 × 2 × 1 × 100 = 600)
To run the default evaluation end-to-end:
-
Build vector DBs (one per embedding model):
python db.py
-
Run the evaluation over the full QA dataset for each emb × gen pairing:
python eval.py
-
Inspect results in
./results:- Results are organized into folders per embedding model
- Each embedding × generative pairing produces a
.jsonlfile
Each .jsonl contains one row per question (and per k, if multiple k values are used). Rows include metadata (labels, IDs), the RAG answer, the human-annotated answer, retrieved context documents, and the judge’s assessment.
To extract the judge outcome, read the "judge_verdict" field:
correct:true/false— whether the judge deemed the answer correct given the human-annotated answergrounded:true/false— whether the judge deemed the answer grounded in the provided contextreasoning:str— the judge’s explanation for correctness and grounding
The "relevant_passage_in_context" field is also useful for determining if the retrieval model was able to deliver the relevant passage as context to the generative model.
Before running db.py and eval.py, you can tune or evaluate a custom configuration via the following knobs.
-
Edit
config.pyto define new embedding and/or generative models using LangChain integrations. For example:# Example embedding model kanon2 = { "model": IsaacusEmbeddings( model="kanon-2-embedder", api_key=ISAACUS_API_KEY, ), "model_name": "kanon2", } # Example generative model gpt52 = { "model": ChatOpenAI( model="gpt-5.2", api_key=OPENAI_API_KEY, temperature=0, reasoning_effort="none", seed=42, ), "model_name": "gpt52", }
If your changes require additional LangChain packages or new provider API keys, update imports and your
.envaccordingly. -
Add your models to the lists the evaluation iterates over:
# For embedding models, add new models to this list embedding_models = [kanon2, ...] # For generative models, add new models to this list generative_models = [gpt52, ...]
-
Change the judge model by assigning a different LangChain LLM to
judge_model:judge_model = { "model": ChatOpenAI( model="gpt-5.2", api_key=OPENAI_API_KEY, reasoning_effort="high", seed=42, ), "model_name": "gpt52_judge", }
-
Evaluate different retrieval depths by editing the
Kslist:# Number of retrieved documents provided as context to the generative model Ks = [5, ...]
Edit prompts.py:
- Update
RAG_PROMPTto change the generative model prompt/style - Update
JUDGE_PROMPTto change the judge rubric/prompt
The default prompts are tuned for strong performance on the Legal RAG Bench corpus.
The results reported for Legal RAG Bench were generated on the 20th of Feburary 2026, shortly after the public preview release of Gemini 3.1 Pro. You may observe different outputs and/or benchmark scores across runs, even with the same prompts and data, for several reasons:
- LLM outputs are non-deterministic. APIs can return different outputs across requests with identical inputs. Setting
temperature=0may reduce variance but does not guarantee determinism. If your provider supports it, consider using a fixedseedand recording any response metadata to aid reproducibility. Note that doing this will deviate from the default settings tested in Legal RAG Bench. - Model providers update models over time. A model “name” or alias may refer to an evolving system. Provider-side updates (weights, routing, safety layers, tool policies, decoding defaults, etc.) can change behavior and benchmark scores between dates, sometimes without a clear semantic-version signal.
- LangChain and other dependencies change quickly. Updates can modify prompt formatting, tokenization, retriever defaults, retry logic, and other behaviors that materially affect results.
- Runtime/infrastructure differences matter. Python versions, OS, vector DB implementations, embedding libraries, and hardware can affect retrieval behavior, latency, and occasionally outputs.
This project is licensed under the MIT license.
@misc{butler2026legalragbench,
title={Legal RAG Bench: an end-to-end benchmark for legal RAG},
author={Abdur-Rahman Butler and Umar Butler},
year={2026},
eprint={2603.01710},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.01710},
}