A hands-on project to build a simple RAG system using LangChain, ChromaDB, and Google Gemini embeddings. Designed for learning and demo purposes, no paid OpenAI API needed! π
This repo shows how to:
- Load and split PDF documents into chunks πβ‘οΈπ
- Generate text embeddings using Google Gemini (or OpenAI if available) π§©
- Store and query embeddings with ChromaDB (vector DB) πΎ
- Build a lightweight Retrieval-Augmented Generation pipeline for search and question answering ππ€
- Use LangChain as an orchestrator for embeddings and retrieval pipelines βοΈ
- β PDF ingestion with metadata tracking
- β Text splitting with overlap for context preservation
- β Embedding generation via Gemini API
- β Persistent vector store with ChromaDB
- β Query interface with top-k retrieval
- β GitHub Actions for CI testing (mocked embedding generation) π§ͺ
- Python 3.10+
- Google Gemini API key (set
GOOGLE_API_KEYin your.env) - (Optional) OpenAI API key if you want to switch embeddings provider
git clone https://github.com/yourusername/mini-rag.git
cd mini-rag
python -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r requirements.txt- Place your PDF files in the
data/pdfsfolder. - Add your API key to
.envfile:
GOOGLE_API_KEY=your_google_gemini_api_key_here
- Run ingestion to build vector store:
python src/ingest.py- Query your RAG system (add your own query interface or notebook).
Run tests locally with:
pytestGitHub Actions automatically run tests on push and pull requests.
- Gemini embedding API usage is currently limited by quota, so be mindful of your request volume.
- Embeddings are 768-dimensional vectors by default.
- This is a learning/demo project, not production ready.