This project implements a Retrieval-Augmented Generation (RAG) pipeline using LangChain, ChromaDB, and Llama 3 via Ollama. It allows you to:
- Load and split PDFs into text chunks.
- Store the chunks in ChromaDB (a vector database) with embeddings generated by a Hugging Face model.
- Query ChromaDB for relevant chunks based on a user’s input.
- Generate responses using Llama 3, a large language model, based on the retrieved chunks.
This solution is ideal for building local, privacy-preserving question-answering systems over PDF documents without relying on external APIs.
🚀 Get started in under 2 minutes:
-
Clone and setup:
git clone https://github.com/devdaviddr/local-pdf-rag-solution cd local-pdf-rag-solution ./setup.sh -
Add your PDFs:
cp /path/to/your/*.pdf source_pdf/ -
Run the application:
./run.sh # For subsequent runs
This project includes two powerful automation scripts:
- ✅ Creates Python virtual environment (
venv) - ✅ Installs all dependencies from
requirements.txt - ✅ Validates system prerequisites (Python 3, required files)
- ✅ Creates
source_pdfdirectory if it doesn't exist - ✅ Checks for PDF files and starts the application automatically
- ✅ Provides colored output with progress indicators
- ✅ Handles errors gracefully with informative messages
- ✅ Activates the existing virtual environment
- ✅ Validates the setup before starting
- ✅ Runs the application with default settings
- ✅ Perfect for daily use after initial setup
Why use the scripts?
- Consistent environment: No more "works on my machine" issues
- Error prevention: Automatic validation of prerequisites
- Time-saving: One command does everything
- Beginner-friendly: No need to remember complex commands
- PDF Processing:
- Loads PDFs from a specified directory and splits them into smaller text chunks using
RecursiveCharacterTextSplitter. - Extracts metadata such as page numbers and document title for better context.
- Loads PDFs from a specified directory and splits them into smaller text chunks using
- Embedding and Storage:
- Generates embeddings for the text chunks using
HuggingFaceEmbeddings. - Stores the embeddings and metadata in ChromaDB for efficient retrieval.
- Generates embeddings for the text chunks using
- Querying and Response Generation:
- Retrieves relevant chunks from ChromaDB using similarity search.
- Generates responses using Llama 3 via Ollama.
- Customizable:
- Adjustable chunk size, chunk overlap, and number of retrieved chunks (
k). - Persistent storage of ChromaDB data for reuse across sessions.
- Supports reindexing of ChromaDB for updating or replacing stored documents.
- Adjustable chunk size, chunk overlap, and number of retrieved chunks (
- Embedding PDFs:
- The script loads PDFs from a directory, splits them into chunks, and stores the chunks in ChromaDB.
- Querying ChromaDB:
- The user inputs a query, and the script retrieves the most relevant chunks from ChromaDB.
- Generating Responses:
- The script uses Llama 3 to generate a response based on the retrieved chunks.
- Python 3.8+ installed on your system
- Ollama installed and running locally (for Llama 3)
- Install Ollama from https://ollama.ai/
- Pull the Llama 3.2 model:
ollama pull llama3.2
- Git for cloning the repository
- At least 2GB of free disk space for models and vector database
-
Clone the repository:
git clone https://github.com/devdaviddr/local-pdf-rag-solution cd local-pdf-rag-solution -
One-command setup:
# Run automated setup script - creates venv, installs dependencies, and starts the app ./setup.shThe setup script will:
- ✅ Create a Python virtual environment (
venv) - ✅ Install all dependencies from
requirements.txt - ✅ Check for PDF files in
source_pdfdirectory - ✅ Automatically start the application
- ✅ Create a Python virtual environment (
-
For subsequent runs (after initial setup):
# Quick run script for future sessions ./run.sh
If you prefer manual control or encounter issues with the automated setup:
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py --data_directory source_pdf# Check if Ollama is running
ollama list
# If Llama 3.2 is not installed, pull it
ollama pull llama3.2
# Test the model
ollama run llama3.2 "Hello, how are you?"The setup script automatically creates a source_pdf directory. Place your PDF files there:
# Copy your PDF files to the source_pdf directory
cp /path/to/your/*.pdf source_pdf/First-time setup:
# Complete setup and start the application
./setup.shSubsequent runs:
# Quick start for already configured environment
./run.shManual execution (if needed):
# Activate virtual environment and run manually
source venv/bin/activate
python app.py --data_directory source_pdfOnce the app is running, you can:
- Ask questions about your documents
- Request summaries of specific topics
- Search for specific information
- Type
exitorquitto end the session
You: Summarize the key findings in the research paper
You: What are the main conclusions?
You: Explain the methodology used in the study
You: What are the limitations mentioned?
You: List the references cited
You: exit
Both setup.sh and run.sh include:
- ✅ Colored output for better readability
- ✅ Error handling with informative messages
- ✅ Prerequisite checks (Python, files, PDFs)
- ✅ Automatic cleanup of old environments
- ✅ Progress indicators throughout the process
The app supports several command-line arguments for customization:
| Argument | Description | Default | Example |
|---|---|---|---|
--data_directory |
Directory containing PDF files | source_pdf |
--data_directory custom_pdfs |
--persist_directory |
Directory to store ChromaDB | my_chroma_db |
--persist_directory custom_db |
--chunk_size |
Size of each text chunk | 1000 |
--chunk_size 500 |
--chunk_overlap |
Overlap between chunks | 200 |
--chunk_overlap 100 |
--reindex |
Reindex the ChromaDB collection | False | --reindex |
Advanced Usage Examples:
# Custom chunk settings for better context
source venv/bin/activate
python app.py --data_directory source_pdf --chunk_size 1500 --chunk_overlap 300
# Use different directories
python app.py --data_directory research_papers --persist_directory research_db
# Rebuild the entire database
python app.py --data_directory source_pdf --reindex- For large documents: Increase
chunk_sizeto 1500-2000 - For better context: Increase
chunk_overlapto 300-400 - For faster processing: Decrease
chunk_sizeto 500-800 - For more precise answers: Decrease
chunk_sizeto 300-500
1. "No module named 'ollama'" error
# If using automated setup, try running setup.sh again
./setup.sh
# Or activate the virtual environment manually
source venv/bin/activate
pip install ollama2. "Connection refused" when querying Llama
# Start Ollama service
ollama serve
# In another terminal, test the connection
ollama run llama3.2 "test"3. "No such file or directory" for PDFs
# Check if your source_pdf directory exists and contains PDF files
ls -la source_pdf/
# Ensure PDFs are readable
file source_pdf/*.pdf4. Setup script fails
# Make sure the script is executable
chmod +x setup.sh run.sh
# Check Python 3 installation
python3 --version
# Run setup with verbose output
bash -x setup.sh4. ChromaDB permission errors
# Check directory permissions
ls -la my_chroma_db/
# Fix permissions if needed
chmod -R 755 my_chroma_db/5. Out of memory errors
# Reduce chunk size and try again
source venv/bin/activate
python app.py --data_directory source_pdf --chunk_size 5006. Virtual environment issues
# Remove and recreate the environment
rm -rf venv
./setup.shInspect ChromaDB Collection:
source venv/bin/activate
python chroma_cli.py --persist_directory my_chroma_dbCheck Ollama Models:
ollama list
ollama show llama3.2Verify Python Environment:
source venv/bin/activate
python --version
pip list | grep -E "(langchain|chromadb|ollama)"Check Script Status:
# Test if scripts are executable
ls -la *.sh
# View script output with debugging
bash -x setup.sh- Start small: Test with 1-2 PDFs before processing large collections
- Backup your ChromaDB: Copy the persist directory before reindexing
- Monitor resources: Large PDF collections can use significant disk space
- Use meaningful directory names: Name your persist directories descriptively
- Regular maintenance: Periodically reindex if you frequently update documents
# 1. Create directory for research papers (or use source_pdf)
mkdir research_papers
cp *.pdf research_papers/
# 2. Process papers using custom directory
source venv/bin/activate
python app.py --data_directory research_papers --persist_directory research_db
# 3. Query examples
You: What are the main research questions addressed?
You: Summarize the methodology section
You: What are the key findings?# 1. Copy PDFs to default directory
cp /path/to/documents/*.pdf source_pdf/
# 2. Run automated setup
./setup.sh
# 3. For future sessions
./run.sh# 1. Initial setup with existing documents
./setup.sh # Uses source_pdf directory by default
# 2. Later, add new documents to a different collection
source venv/bin/activate
python app.py --data_directory new_documents --persist_directory doc_library
# 3. Query the combined library
python app.py --persist_directory doc_library# 1. Organize documents by topic
mkdir -p knowledge_base/{tech,finance,health}
# 2. Process each category separately
source venv/bin/activate
python app.py --data_directory knowledge_base/tech --persist_directory kb_tech
python app.py --data_directory knowledge_base/finance --persist_directory kb_finance
# 3. Query specific domains
python app.py --persist_directory kb_techThe project includes a utility tool for inspecting and debugging your ChromaDB collections.
- Database Inspection: View all documents and metadata stored in your ChromaDB collection
- Debugging Aid: Verify that PDFs were properly processed and stored
- Data Exploration: Understand what text chunks exist in your vector database
- Quality Control: Ensure text chunks are meaningful and metadata is preserved
# Inspect your ChromaDB collection
source venv/bin/activate
python chroma_cli.py --persist_directory my_chroma_dbChromaDB Schema and Metadata:
Document 1:
Text: This is a chunk of text from page 1 of document.pdf...
Metadata: {'source': '/path/to/document.pdf', 'page': 1}
--------------------------------------------------
Document 2:
Text: Another chunk of text from the same document...
Metadata: {'source': '/path/to/document.pdf', 'page': 1}
--------------------------------------------------
- After processing PDFs to verify successful ingestion
- Before querying to understand what data is available
- When troubleshooting retrieval issues
- To validate chunk size and overlap settings
The text chunking strategy significantly affects answer quality:
# For technical documents (more context needed)
python app/app.py --data_directory pdfs --chunk_size 1500 --chunk_overlap 300
# For general documents (balanced approach)
python app/app.py --data_directory pdfs --chunk_size 1000 --chunk_overlap 200
# For quick processing (less context, faster)
python app/app.py --data_directory pdfs --chunk_size 500 --chunk_overlap 100Modify the number of chunks retrieved for each query by editing the k parameter in the query_chromadb method in app.py:
# In ChromaDBManager.query_chromadb method
results = db.similarity_search(query, k=10) # Retrieve more chunks for better contextOrganize different document collections:
# Legal documents
python app/app.py --data_directory legal_docs --persist_directory legal_db
# Technical manuals
python app/app.py --data_directory manuals --persist_directory technical_db
# Academic papers
python app/app.py --data_directory papers --persist_directory academic_dbYou can set default values using environment variables:
export PDF_DATA_DIR="./pdfs"
export CHROMA_PERSIST_DIR="./my_chroma_db"
export CHUNK_SIZE=1000
export CHUNK_OVERLAP=200The application follows a modular design pattern:
PDFProcessor: Handles loading and splitting PDFs into text chunks with metadata preservationChromaDBManager: Manages all ChromaDB operations including storage, querying, and reindexingLLMResponseGenerator: Interfaces with Ollama to generate responses using Llama 3.2RAGPipeline: Orchestrates the entire pipeline, coordinating between components
- Ingestion: PDFs → Text Chunks → Embeddings → ChromaDB
- Retrieval: User Query → Similarity Search → Relevant Chunks
- Generation: Relevant Chunks + Query → Llama 3.2 → Response
my_chroma_db/
├── chroma.sqlite3 # ChromaDB database
├── index/ # Vector indices
└── logs/ # Operation logs
- Primary: PDF files (.pdf)
- Future: The architecture supports extending to other document types
- RAM Usage: ~2-4GB for typical document collections
- Disk Usage: ~10-50MB per 100 pages of PDFs
- Processing Speed: ~1-5 seconds per page depending on content complexity
- LangChain: Framework for document loading, text splitting, and vector storage
- ChromaDB: Vector database for storing and retrieving embeddings
- Hugging Face Transformers: For generating text embeddings (all-MiniLM-L6-v2 model)
- Ollama: For running Llama 3.2 locally
- PyPDF: For PDF text extraction
- Operating System: macOS, Linux, Windows (WSL recommended)
- Python: 3.8 or higher
- Memory: Minimum 4GB RAM, 8GB recommended
- Storage: 2GB free space for models and databases
- Network: Internet connection for initial model downloads
- Embedding Model:
all-MiniLM-L6-v2(384 dimensions, ~90MB) - Language Model:
llama3.2(varies by size: 1B, 3B, or 8B parameters) - Download Time: 5-30 minutes depending on model size and connection speed
A: The system can handle hundreds of PDFs, limited mainly by available disk space and memory. Start with 10-20 documents for testing.
A: Yes! Modify the model parameter in the LLMResponseGenerator.generate_response() method. Ensure the model is available in Ollama:
ollama pull mistral # Example alternative modelA: Try these approaches:
- Increase
chunk_overlapfor better context continuity - Adjust
kparameter to retrieve more relevant chunks - Use more specific queries
- Ensure your PDFs have good text quality (not just scanned images)
A: This version works with text-based PDFs. For scanned PDFs, you'll need OCR preprocessing. Consider using tools like Tesseract or Adobe Acrobat to convert scanned PDFs to text-searchable format first.
A: Yes! Everything runs locally on your machine. No data is sent to external servers. Your documents and queries remain completely private.
A: Simply copy the entire persist directory:
cp -r my_chroma_db my_chroma_db_backupA: Yes! The application can run on servers. For remote access, you might want to add a web interface using Streamlit or Flask.
- Web Interface: Streamlit or Gradio UI for easier interaction
- Multi-format Support: Word documents, PowerPoint, text files
- OCR Integration: Support for scanned PDFs and images
- Advanced Search: Boolean operators, date filtering, source filtering
- Export Features: Save conversations, export chunks, generate reports
- Multi-language Support: Documents in languages other than English
- Cloud Deployment: Docker containers and cloud deployment guides
We encourage contributions in these areas:
- UI improvements and web interfaces
- Support for additional document formats
- Performance optimizations
- Documentation translations
- Example use cases and tutorials
This project is open-source and available under the MIT License.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature). - Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/YourFeature). - Open a pull request.
- LangChain for providing the framework for building RAG pipelines.
- ChromaDB for efficient vector storage and retrieval.
- Ollama for making Llama 3 accessible locally.
If you encounter any issues or have questions, please open an issue on the GitHub repository.
This README provides a clear, structured, and engaging overview of the project, its features, and how to use it. Let me know if you’d like to add or modify anything! 🚀
This updated README reflects the functionality and structure of the provided code, ensuring users have all the necessary information to use and customize the project.