RAG Pipeline with LangChain and Streamlit

A Retrieval-Augmented Generation (RAG) pipeline that allows users to upload documents, process them, and ask questions using LangChain and Groq's LLaMA3 model.

Features

Upload and process multiple file types:
- PDF documents
- Text files (.txt)
- Images (.jpg, .png) with OCR support
- Web page crawling (optional)
Document processing with chunking and vector storage
Question answering with source citations
Modern Streamlit UI

Prerequisites

Python 3.8+
Tesseract OCR installed on your system
Groq API key

Installing Tesseract OCR

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

Installation

Clone the repository:

git clone <repository-url>
cd rag-pipeline

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here

Usage

Start the Streamlit app:

streamlit run app.py

Open your web browser and navigate to the provided URL (usually http://localhost:8501)
Use the sidebar to:
- Upload documents (PDF, TXT, images)
- Enter a URL to crawl (optional)
- Input your Groq API key
Once documents are processed, use the main interface to ask questions about your documents

Notes

The application uses ChromaDB for vector storage, which persists data in the ./chroma_db directory
For image processing, make sure Tesseract OCR is properly installed and accessible
Large documents may take some time to process
The quality of answers depends on the quality of the input documents and text extraction

Limitations

Maximum file size depends on your system's memory
Image OCR quality depends on the image quality and Tesseract's capabilities
Web crawling may be limited by website robots.txt rules

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Pipeline with LangChain and Streamlit

Features

Prerequisites

Installing Tesseract OCR

Installation

Usage

Notes

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline with LangChain and Streamlit

Features

Prerequisites

Installing Tesseract OCR

Installation

Usage

Notes

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages