Evaluation-First, Multilingual RAG Framework with Local Deployment Support
AuroraRAG is a production-ready Retrieval-Augmented Generation (RAG) framework designed for observable search quality, multilingual support, and flexible deployment options. Built with transparency and measurability at its core, AuroraRAG enables teams to build, evaluate, and deploy RAG systems with confidence.
Part of the Aurora Series:
AuroraRAG | Aurora SAR Change Detection
- Overview
- Key Features
- Architecture
- Getting Started
- API Reference
- Evaluation Framework
- Configuration
- Deployment
- Production Considerations
- Project Structure
- Troubleshooting
- Contributing
- License
Modern RAG systems often provide subjectively good results but lack measurable performance indicators. AuroraRAG addresses this challenge by making search quality observable and answer generation reproducible through:
- Clear Separation of Concerns: Distinct retrieval, reranking, and generation stages
- Transparent Scoring: Inner-product similarity (FAISS) combined with cross-encoder reranking
- Built-in Evaluation: Information retrieval metrics (MRR, nDCG) for continuous improvement
- Multilingual Excellence: Optimized for Finnish and English with BGE-M3 model family
- Flexible Deployment: Supports both cloud-based and fully offline operation
- Internal knowledge base search
- Policy and compliance Q&A systems
- Technical support assistants
- Multilingual document repositories
- Privacy-sensitive applications requiring local deployment
Built-in information retrieval metrics (Mean Reciprocal Rank, Normalized Discounted Cumulative Gain) enable systematic evaluation of retrieval quality and A/B testing of system components.
Leverages BAAI's BGE-M3 model family for robust multilingual retrieval and reranking, with particular strength in Finnish and English corpora.
Operates entirely offline using Ollama for local LLM inference and CPU-optimized FAISS indexing, ensuring data privacy and eliminating cloud dependencies.
FastAPI-based REST API with comprehensive endpoints for search, answer generation, and health monitoring, ready for integration into existing systems.
AuroraRAG implements a three-stage pipeline optimizing for both recall and precision:
┌─────────────────────────────────────────────────────────────────┐
│ Query Input │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Dense Embedding Layer │
│ (BGE-M3 Encoder) │
└────────────────────────┬────────────────────────────────────────┘
│ Vector Representations
▼
┌─────────────────────────────────────────────────────────────────┐
│ Fast Retrieval Stage │
│ (FAISS Inner Product Search) │
│ Returns Top-K │
└────────────────────────┬────────────────────────────────────────┘
│ Candidate Passages + Similarity Scores
▼
┌─────────────────────────────────────────────────────────────────┐
│ Precision Reranking │
│ (BGE-reranker-v2-M3 Cross-Encoder) │
└────────────────────────┬────────────────────────────────────────┘
│ Reranked Results + Confidence Scores
▼
┌─────────────────────────────────────────────────────────────────┐
│ Answer Generation (Optional) │
│ (Ollama LLM / API) │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Structured Response with Citations │
└─────────────────────────────────────────────────────────────────┘
- BGE-M3 Embeddings: State-of-the-art multilingual dense retrieval, effective across query lengths
- BGE-reranker-v2-M3: Powerful cross-encoder for improving precision on top-K candidates
- FAISS IndexFlatIP: Exact search baseline (upgradeable to approximate methods for scale)
- Evaluation Pipeline: Quantitative metrics (MRR@K, nDCG@K) over labeled query sets
- Python 3.9 or higher
- (Optional) Ollama for local LLM inference
-
Clone the repository
git clone https://github.com/rikulauttia/aurora-rag.git cd aurora-rag -
Set up virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -U pip pip install -r requirements.txt
Initialize the FAISS index using the sample documents:
python inference/encode_index.pyThis processes documents in data/sample_docs/ and creates the searchable index.
For offline answer generation with Ollama:
macOS:
brew install ollamaLinux/Windows:
Follow instructions at ollama.ai
Configure and run:
ollama serve # Keep this running in a separate terminal
ollama pull qwen2.5:7b-instruct-q4_0Set environment variables:
export OLLAMA_URL=http://127.0.0.1:11434
export OLLAMA_MODEL=qwen2.5:7b-instruct-q4_0Start the FastAPI server:
uvicorn app.server:app --host 0.0.0.0 --port 8080 --reloadThe API will be available at http://127.0.0.1:8080
Search endpoint (retrieval + reranking):
curl -X POST http://127.0.0.1:8080/search \
-H "Content-Type: application/json" \
-d '{"query":"Missä voin opiskella koneoppimista Suomessa?", "k": 5}' | jq .Answer endpoint (with LLM generation):
curl -X POST http://127.0.0.1:8080/answer \
-H "Content-Type: application/json" \
-d '{"query":"Missä voin opiskella koneoppimista Suomessa?"}' | jq .Retrieve and rerank relevant passages without LLM generation.
Request Body:
{
"query": "string",
"k": 5
}Response:
{
"query": "Missä voin opiskella koneoppimista Suomessa?",
"hits": [
{
"doc_id": 2,
"text": "...relevant passage excerpt...",
"path": "data/sample_docs/003.txt",
"sim_ip": 0.41,
"rerank": 3.22
}
]
}Fields:
sim_ip: Inner product similarity score from FAISSrerank: Cross-encoder reranking score (higher is better)
Generate a natural language answer using retrieved passages.
Request Body:
{
"query": "string",
"k": 5,
"max_tokens": 512
}Response:
{
"query": "...",
"answer": "Generated answer with citations",
"passages": [...],
"metadata": {
"model": "qwen2.5:7b-instruct-q4_0",
"retrieval_count": 5
}
}Health check endpoint.
Response:
{
"status": "ok"
}AuroraRAG includes built-in information retrieval evaluation capabilities.
- MRR@K (Mean Reciprocal Rank): Measures the rank quality of the first relevant result
- nDCG@K (Normalized Discounted Cumulative Gain): Evaluates graded relevance distribution
python -m training.eval_irExample Output:
MRR@5: 0.778
nDCG@5: 0.833
- Create labeled query-document pairs in
training/eval_ir.py - Define relevance judgments (binary or graded)
- Run evaluation after system modifications
- Compare metrics to quantify improvements
Best Practices:
- Maintain a diverse test set covering common query patterns
- Include edge cases and multilingual queries
- Re-evaluate after changes to embeddings, reranking, or chunking strategy
- Track metrics over time to detect regressions
| Variable | Default | Description |
|---|---|---|
EMB_MODEL |
BAAI/bge-m3 |
Hugging Face embedding model identifier |
RERANK_MODEL |
BAAI/bge-reranker-v2-m3 |
Cross-encoder reranking model |
K |
5 |
Number of candidates retrieved from FAISS |
OLLAMA_URL |
http://127.0.0.1:11434 |
Ollama API endpoint |
OLLAMA_MODEL |
qwen2.5:7b-instruct-q4_0 |
LLM model for answer generation |
To use different models, update environment variables before starting the server:
export EMB_MODEL=sentence-transformers/paraphrase-multilingual-mpnet-base-v2
export RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2Note: Ensure models are compatible with the expected input/output format.
Build the image:
docker build -t aurora-rag .Run the container:
docker run -p 8000:8000 aurora-ragThe API will be accessible at http://127.0.0.1:8000
The Docker image automatically builds the FAISS index on first run if not pre-built.
For a web-based demo interface, deploy to Hugging Face Spaces:
- Create a Space repository
- Include:
app.py,requirements.txt,index.faiss(optional),meta.json(optional) - Push to trigger automatic deployment
Update Space:
cd aurora-rag-space
git add . && git commit -m "Update Space configuration" && git pushLarge-scale indexing (>1M documents):
- Replace
IndexFlatIPwithIndexIVFFlatorIndexHNSWFlat - Train index quantizers on representative data samples
- Implement index sharding for distributed search
- Persist FAISS indices to object storage (S3, GCS)
- Target 256-512 token chunks with sentence boundary preservation
- Include document metadata (titles, sections) for context-aware reranking
- Experiment with overlapping chunks for long documents
- Maintain chunk-to-source mappings for citation generation
- Batch query-passage pairs to maximize GPU utilization
- Keep reranker models in memory (avoid reload overhead)
- Consider INT8 quantization for throughput improvements
- Implement caching for frequent query patterns
- Use structured prompts with clear instructions and formatting
- Enforce maximum context length to prevent truncation issues
- Implement evidence quality checks (refuse low-confidence answers)
- Add citation validation to ensure answer grounding
Key metrics to monitor:
- Query latency by pipeline stage (retrieval, reranking, generation)
- Hit rate and median rank of first relevant document
- Cache hit rates for embeddings and results
- Model inference times and resource utilization
- Implement input validation and sanitization
- Avoid indexing personally identifiable information (PII)
- Add content filtering for sensitive domains
- Use allowlists for external API calls
- Implement rate limiting and authentication for production APIs
aurora-rag/
├── app/
│ └── server.py # FastAPI application with REST endpoints
├── inference/
│ ├── encode_index.py # FAISS index builder
│ ├── search.py # Retrieval and reranking logic
│ └── generate.py # LLM inference wrapper
├── training/
│ ├── eval_ir.py # IR evaluation metrics (MRR, nDCG)
│ └── (additional utilities)
├── data/
│ └── sample_docs/ # Example document corpus
├── requirements.txt # Python dependencies
├── Dockerfile # Container definition
└── README.md # This file
ModuleNotFoundError for inference or training
Ensure __init__.py files exist in module directories and run modules with the -m flag:
python -m training.eval_irLong initial startup time
First run downloads models from Hugging Face Hub. To mitigate:
- Pin model versions in
requirements.txt - Pre-cache models in CI/CD pipeline
- Use Docker images with pre-downloaded models
Ollama not being used for answer generation
Verify environment variables are set:
echo $OLLAMA_URL
echo $OLLAMA_MODELIf not configured, /answer endpoint returns synthesized responses from retrieved passages without LLM generation.
FAISS index not found
Run the index builder before starting the API:
python inference/encode_index.pyFor additional support:
- Check existing GitHub Issues
- Review API documentation at
/docswhen server is running - Create a new issue with:
- Reproduction steps
- Error logs
- Output of
pip freeze - System information (OS, Python version)
We welcome contributions! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Make your changes with clear commit messages
- Add or update tests as needed
- Submit a pull request
For major changes, please open an issue first to discuss the proposed modifications.
pip install -r requirements-dev.txt # If available
pre-commit install # If using pre-commit hooksThis project is licensed under the MIT License. See the LICENSE file for details.
AuroraRAG builds upon excellent open-source projects:
- BAAI BGE Models for multilingual embeddings
- FAISS for efficient similarity search
- Ollama for local LLM inference
- FastAPI for API framework
rag faiss information-retrieval reranking transformers fastapi gradio ollama multilingual evaluation bge semantic-search
Questions about implementation, scaling, or evaluation strategies?
Open a discussion or reach out via GitHub Issues.