Skip to content

devdaviddr/local-pdf-rag-solution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local PDF RAG Solution

Project Overview

This project implements a Retrieval-Augmented Generation (RAG) pipeline using LangChain, ChromaDB, and Llama 3 via Ollama. It allows you to:

  1. Load and split PDFs into text chunks.
  2. Store the chunks in ChromaDB (a vector database) with embeddings generated by a Hugging Face model.
  3. Query ChromaDB for relevant chunks based on a user’s input.
  4. Generate responses using Llama 3, a large language model, based on the retrieved chunks.

This solution is ideal for building local, privacy-preserving question-answering systems over PDF documents without relying on external APIs.


Quick Start

🚀 Get started in under 2 minutes:

  1. Clone and setup:

    git clone https://github.com/devdaviddr/local-pdf-rag-solution
    cd local-pdf-rag-solution
    ./setup.sh
  2. Add your PDFs:

    cp /path/to/your/*.pdf source_pdf/
  3. Run the application:

    ./run.sh  # For subsequent runs

Automated Setup Scripts

This project includes two powerful automation scripts:

setup.sh - Complete Environment Setup

  • ✅ Creates Python virtual environment (venv)
  • ✅ Installs all dependencies from requirements.txt
  • ✅ Validates system prerequisites (Python 3, required files)
  • ✅ Creates source_pdf directory if it doesn't exist
  • ✅ Checks for PDF files and starts the application automatically
  • ✅ Provides colored output with progress indicators
  • ✅ Handles errors gracefully with informative messages

run.sh - Quick Application Launcher

  • ✅ Activates the existing virtual environment
  • ✅ Validates the setup before starting
  • ✅ Runs the application with default settings
  • ✅ Perfect for daily use after initial setup

Why use the scripts?

  • Consistent environment: No more "works on my machine" issues
  • Error prevention: Automatic validation of prerequisites
  • Time-saving: One command does everything
  • Beginner-friendly: No need to remember complex commands

Key Features

  • PDF Processing:
    • Loads PDFs from a specified directory and splits them into smaller text chunks using RecursiveCharacterTextSplitter.
    • Extracts metadata such as page numbers and document title for better context.
  • Embedding and Storage:
    • Generates embeddings for the text chunks using HuggingFaceEmbeddings.
    • Stores the embeddings and metadata in ChromaDB for efficient retrieval.
  • Querying and Response Generation:
    • Retrieves relevant chunks from ChromaDB using similarity search.
    • Generates responses using Llama 3 via Ollama.
  • Customizable:
    • Adjustable chunk size, chunk overlap, and number of retrieved chunks (k).
    • Persistent storage of ChromaDB data for reuse across sessions.
    • Supports reindexing of ChromaDB for updating or replacing stored documents.

How It Works

  1. Embedding PDFs:
    • The script loads PDFs from a directory, splits them into chunks, and stores the chunks in ChromaDB.
  2. Querying ChromaDB:
    • The user inputs a query, and the script retrieves the most relevant chunks from ChromaDB.
  3. Generating Responses:
    • The script uses Llama 3 to generate a response based on the retrieved chunks.

Getting Started

Prerequisites

  • Python 3.8+ installed on your system
  • Ollama installed and running locally (for Llama 3)
  • Git for cloning the repository
  • At least 2GB of free disk space for models and vector database

Installation

Quick Start (Recommended)

  1. Clone the repository:

    git clone https://github.com/devdaviddr/local-pdf-rag-solution
    cd local-pdf-rag-solution
  2. One-command setup:

    # Run automated setup script - creates venv, installs dependencies, and starts the app
    ./setup.sh

    The setup script will:

    • ✅ Create a Python virtual environment (venv)
    • ✅ Install all dependencies from requirements.txt
    • ✅ Check for PDF files in source_pdf directory
    • ✅ Automatically start the application
  3. For subsequent runs (after initial setup):

    # Quick run script for future sessions
    ./run.sh

Manual Setup (Alternative)

If you prefer manual control or encounter issues with the automated setup:

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py --data_directory source_pdf

Verify Ollama and Llama 3.2 Installation

# Check if Ollama is running
ollama list

# If Llama 3.2 is not installed, pull it
ollama pull llama3.2

# Test the model
ollama run llama3.2 "Hello, how are you?"

Usage

Step 1: Prepare Your PDFs

The setup script automatically creates a source_pdf directory. Place your PDF files there:

# Copy your PDF files to the source_pdf directory
cp /path/to/your/*.pdf source_pdf/

Step 2: Run the Application

First-time setup:

# Complete setup and start the application
./setup.sh

Subsequent runs:

# Quick start for already configured environment
./run.sh

Manual execution (if needed):

# Activate virtual environment and run manually
source venv/bin/activate
python app.py --data_directory source_pdf

Step 3: Interactive Querying

Once the app is running, you can:

  • Ask questions about your documents
  • Request summaries of specific topics
  • Search for specific information
  • Type exit or quit to end the session

Common Query Examples

You: Summarize the key findings in the research paper
You: What are the main conclusions?
You: Explain the methodology used in the study
You: What are the limitations mentioned?
You: List the references cited
You: exit

Script Features

Both setup.sh and run.sh include:

  • Colored output for better readability
  • Error handling with informative messages
  • Prerequisite checks (Python, files, PDFs)
  • Automatic cleanup of old environments
  • Progress indicators throughout the process

Advanced Configuration and Troubleshooting

Command Line Arguments

The app supports several command-line arguments for customization:

Argument Description Default Example
--data_directory Directory containing PDF files source_pdf --data_directory custom_pdfs
--persist_directory Directory to store ChromaDB my_chroma_db --persist_directory custom_db
--chunk_size Size of each text chunk 1000 --chunk_size 500
--chunk_overlap Overlap between chunks 200 --chunk_overlap 100
--reindex Reindex the ChromaDB collection False --reindex

Advanced Usage Examples:

# Custom chunk settings for better context
source venv/bin/activate
python app.py --data_directory source_pdf --chunk_size 1500 --chunk_overlap 300

# Use different directories
python app.py --data_directory research_papers --persist_directory research_db

# Rebuild the entire database
python app.py --data_directory source_pdf --reindex

Performance Tuning

  • For large documents: Increase chunk_size to 1500-2000
  • For better context: Increase chunk_overlap to 300-400
  • For faster processing: Decrease chunk_size to 500-800
  • For more precise answers: Decrease chunk_size to 300-500

Troubleshooting

Common Issues and Solutions

1. "No module named 'ollama'" error

# If using automated setup, try running setup.sh again
./setup.sh

# Or activate the virtual environment manually
source venv/bin/activate
pip install ollama

2. "Connection refused" when querying Llama

# Start Ollama service
ollama serve
# In another terminal, test the connection
ollama run llama3.2 "test"

3. "No such file or directory" for PDFs

# Check if your source_pdf directory exists and contains PDF files
ls -la source_pdf/
# Ensure PDFs are readable
file source_pdf/*.pdf

4. Setup script fails

# Make sure the script is executable
chmod +x setup.sh run.sh

# Check Python 3 installation
python3 --version

# Run setup with verbose output
bash -x setup.sh

4. ChromaDB permission errors

# Check directory permissions
ls -la my_chroma_db/
# Fix permissions if needed
chmod -R 755 my_chroma_db/

5. Out of memory errors

# Reduce chunk size and try again
source venv/bin/activate
python app.py --data_directory source_pdf --chunk_size 500

6. Virtual environment issues

# Remove and recreate the environment
rm -rf venv
./setup.sh

Debugging Tools

Inspect ChromaDB Collection:

source venv/bin/activate
python chroma_cli.py --persist_directory my_chroma_db

Check Ollama Models:

ollama list
ollama show llama3.2

Verify Python Environment:

source venv/bin/activate
python --version
pip list | grep -E "(langchain|chromadb|ollama)"

Check Script Status:

# Test if scripts are executable
ls -la *.sh

# View script output with debugging
bash -x setup.sh

Best Practices

  1. Start small: Test with 1-2 PDFs before processing large collections
  2. Backup your ChromaDB: Copy the persist directory before reindexing
  3. Monitor resources: Large PDF collections can use significant disk space
  4. Use meaningful directory names: Name your persist directories descriptively
  5. Regular maintenance: Periodically reindex if you frequently update documents

Example Workflows

Workflow 1: Research Paper Analysis

# 1. Create directory for research papers (or use source_pdf)
mkdir research_papers
cp *.pdf research_papers/

# 2. Process papers using custom directory
source venv/bin/activate
python app.py --data_directory research_papers --persist_directory research_db

# 3. Query examples
You: What are the main research questions addressed?
You: Summarize the methodology section
You: What are the key findings?

Workflow 2: Quick Start with Default Setup

# 1. Copy PDFs to default directory
cp /path/to/documents/*.pdf source_pdf/

# 2. Run automated setup
./setup.sh

# 3. For future sessions
./run.sh

Workflow 2: Document Library Management

# 1. Initial setup with existing documents
./setup.sh  # Uses source_pdf directory by default

# 2. Later, add new documents to a different collection
source venv/bin/activate
python app.py --data_directory new_documents --persist_directory doc_library

# 3. Query the combined library
python app.py --persist_directory doc_library

Workflow 3: Personal Knowledge Base

# 1. Organize documents by topic
mkdir -p knowledge_base/{tech,finance,health}

# 2. Process each category separately
source venv/bin/activate
python app.py --data_directory knowledge_base/tech --persist_directory kb_tech
python app.py --data_directory knowledge_base/finance --persist_directory kb_finance

# 3. Query specific domains
python app.py --persist_directory kb_tech

Utilities and Tools

ChromaDB CLI (chroma_cli.py)

The project includes a utility tool for inspecting and debugging your ChromaDB collections.

Purpose:

  • Database Inspection: View all documents and metadata stored in your ChromaDB collection
  • Debugging Aid: Verify that PDFs were properly processed and stored
  • Data Exploration: Understand what text chunks exist in your vector database
  • Quality Control: Ensure text chunks are meaningful and metadata is preserved

Usage:

# Inspect your ChromaDB collection
source venv/bin/activate
python chroma_cli.py --persist_directory my_chroma_db

Example Output:

ChromaDB Schema and Metadata:
Document 1:
Text: This is a chunk of text from page 1 of document.pdf...
Metadata: {'source': '/path/to/document.pdf', 'page': 1}
--------------------------------------------------
Document 2:
Text: Another chunk of text from the same document...
Metadata: {'source': '/path/to/document.pdf', 'page': 1}
--------------------------------------------------

When to Use:

  • After processing PDFs to verify successful ingestion
  • Before querying to understand what data is available
  • When troubleshooting retrieval issues
  • To validate chunk size and overlap settings


Customization

Chunk Size and Overlap Optimization

The text chunking strategy significantly affects answer quality:

# For technical documents (more context needed)
python app/app.py --data_directory pdfs --chunk_size 1500 --chunk_overlap 300

# For general documents (balanced approach)
python app/app.py --data_directory pdfs --chunk_size 1000 --chunk_overlap 200

# For quick processing (less context, faster)
python app/app.py --data_directory pdfs --chunk_size 500 --chunk_overlap 100

Retrieval Configuration

Modify the number of chunks retrieved for each query by editing the k parameter in the query_chromadb method in app.py:

# In ChromaDBManager.query_chromadb method
results = db.similarity_search(query, k=10)  # Retrieve more chunks for better context

Custom Persist Directories

Organize different document collections:

# Legal documents
python app/app.py --data_directory legal_docs --persist_directory legal_db

# Technical manuals  
python app/app.py --data_directory manuals --persist_directory technical_db

# Academic papers
python app/app.py --data_directory papers --persist_directory academic_db

Environment Variables

You can set default values using environment variables:

export PDF_DATA_DIR="./pdfs"
export CHROMA_PERSIST_DIR="./my_chroma_db"
export CHUNK_SIZE=1000
export CHUNK_OVERLAP=200


Technical Architecture

Code Structure

The application follows a modular design pattern:

  • PDFProcessor: Handles loading and splitting PDFs into text chunks with metadata preservation
  • ChromaDBManager: Manages all ChromaDB operations including storage, querying, and reindexing
  • LLMResponseGenerator: Interfaces with Ollama to generate responses using Llama 3.2
  • RAGPipeline: Orchestrates the entire pipeline, coordinating between components

Data Flow

  1. Ingestion: PDFs → Text Chunks → Embeddings → ChromaDB
  2. Retrieval: User Query → Similarity Search → Relevant Chunks
  3. Generation: Relevant Chunks + Query → Llama 3.2 → Response

Storage Structure

my_chroma_db/
├── chroma.sqlite3          # ChromaDB database
├── index/                  # Vector indices
└── logs/                   # Operation logs

Supported File Types

  • Primary: PDF files (.pdf)
  • Future: The architecture supports extending to other document types

Memory and Performance

  • RAM Usage: ~2-4GB for typical document collections
  • Disk Usage: ~10-50MB per 100 pages of PDFs
  • Processing Speed: ~1-5 seconds per page depending on content complexity


Dependencies

Core Dependencies

  • LangChain: Framework for document loading, text splitting, and vector storage
  • ChromaDB: Vector database for storing and retrieving embeddings
  • Hugging Face Transformers: For generating text embeddings (all-MiniLM-L6-v2 model)
  • Ollama: For running Llama 3.2 locally
  • PyPDF: For PDF text extraction

System Requirements

  • Operating System: macOS, Linux, Windows (WSL recommended)
  • Python: 3.8 or higher
  • Memory: Minimum 4GB RAM, 8GB recommended
  • Storage: 2GB free space for models and databases
  • Network: Internet connection for initial model downloads

Model Information

  • Embedding Model: all-MiniLM-L6-v2 (384 dimensions, ~90MB)
  • Language Model: llama3.2 (varies by size: 1B, 3B, or 8B parameters)
  • Download Time: 5-30 minutes depending on model size and connection speed

Frequently Asked Questions (FAQ)

Q: How many PDFs can I process at once?

A: The system can handle hundreds of PDFs, limited mainly by available disk space and memory. Start with 10-20 documents for testing.

Q: Can I use a different language model?

A: Yes! Modify the model parameter in the LLMResponseGenerator.generate_response() method. Ensure the model is available in Ollama:

ollama pull mistral  # Example alternative model

Q: How do I improve answer accuracy?

A: Try these approaches:

  • Increase chunk_overlap for better context continuity
  • Adjust k parameter to retrieve more relevant chunks
  • Use more specific queries
  • Ensure your PDFs have good text quality (not just scanned images)

Q: Can I use this with scanned PDFs?

A: This version works with text-based PDFs. For scanned PDFs, you'll need OCR preprocessing. Consider using tools like Tesseract or Adobe Acrobat to convert scanned PDFs to text-searchable format first.

Q: Is my data secure?

A: Yes! Everything runs locally on your machine. No data is sent to external servers. Your documents and queries remain completely private.

Q: How do I back up my vector database?

A: Simply copy the entire persist directory:

cp -r my_chroma_db my_chroma_db_backup

Q: Can I run this on a server?

A: Yes! The application can run on servers. For remote access, you might want to add a web interface using Streamlit or Flask.


Roadmap and Future Enhancements

Planned Features

  • Web Interface: Streamlit or Gradio UI for easier interaction
  • Multi-format Support: Word documents, PowerPoint, text files
  • OCR Integration: Support for scanned PDFs and images
  • Advanced Search: Boolean operators, date filtering, source filtering
  • Export Features: Save conversations, export chunks, generate reports
  • Multi-language Support: Documents in languages other than English
  • Cloud Deployment: Docker containers and cloud deployment guides

Community Contributions Welcome

We encourage contributions in these areas:

  • UI improvements and web interfaces
  • Support for additional document formats
  • Performance optimizations
  • Documentation translations
  • Example use cases and tutorials

License

This project is open-source and available under the MIT License.


Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/YourFeature).
  3. Commit your changes (git commit -m 'Add some feature').
  4. Push to the branch (git push origin feature/YourFeature).
  5. Open a pull request.

Acknowledgments

  • LangChain for providing the framework for building RAG pipelines.
  • ChromaDB for efficient vector storage and retrieval.
  • Ollama for making Llama 3 accessible locally.

Support

If you encounter any issues or have questions, please open an issue on the GitHub repository.


This README provides a clear, structured, and engaging overview of the project, its features, and how to use it. Let me know if you’d like to add or modify anything! 🚀


This updated README reflects the functionality and structure of the provided code, ensuring users have all the necessary information to use and customize the project.

About

Building a RAG solution that runs locally using Ollama

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors