Skip to content

DhanushPrince/Rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

RAG Pipeline with LangChain and Streamlit

A Retrieval-Augmented Generation (RAG) pipeline that allows users to upload documents, process them, and ask questions using LangChain and Groq's LLaMA3 model.

Features

  • Upload and process multiple file types:
    • PDF documents
    • Text files (.txt)
    • Images (.jpg, .png) with OCR support
    • Web page crawling (optional)
  • Document processing with chunking and vector storage
  • Question answering with source citations
  • Modern Streamlit UI

Prerequisites

  • Python 3.8+
  • Tesseract OCR installed on your system
  • Groq API key

Installing Tesseract OCR

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

Installation

  1. Clone the repository:
git clone <repository-url>
cd rag-pipeline
  1. Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Create a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here

Usage

  1. Start the Streamlit app:
streamlit run app.py
  1. Open your web browser and navigate to the provided URL (usually http://localhost:8501)

  2. Use the sidebar to:

    • Upload documents (PDF, TXT, images)
    • Enter a URL to crawl (optional)
    • Input your Groq API key
  3. Once documents are processed, use the main interface to ask questions about your documents

Notes

  • The application uses ChromaDB for vector storage, which persists data in the ./chroma_db directory
  • For image processing, make sure Tesseract OCR is properly installed and accessible
  • Large documents may take some time to process
  • The quality of answers depends on the quality of the input documents and text extraction

Limitations

  • Maximum file size depends on your system's memory
  • Image OCR quality depends on the image quality and Tesseract's capabilities
  • Web crawling may be limited by website robots.txt rules

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages