A Retrieval-Augmented Generation (RAG) pipeline that allows users to upload documents, process them, and ask questions using LangChain and Groq's LLaMA3 model.
- Upload and process multiple file types:
- PDF documents
- Text files (.txt)
- Images (.jpg, .png) with OCR support
- Web page crawling (optional)
- Document processing with chunking and vector storage
- Question answering with source citations
- Modern Streamlit UI
- Python 3.8+
- Tesseract OCR installed on your system
- Groq API key
macOS:
brew install tesseractUbuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocrWindows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
- Clone the repository:
git clone <repository-url>
cd rag-pipeline- Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create a
.envfile in the project root:
GROQ_API_KEY=your_groq_api_key_here- Start the Streamlit app:
streamlit run app.py-
Open your web browser and navigate to the provided URL (usually http://localhost:8501)
-
Use the sidebar to:
- Upload documents (PDF, TXT, images)
- Enter a URL to crawl (optional)
- Input your Groq API key
-
Once documents are processed, use the main interface to ask questions about your documents
- The application uses ChromaDB for vector storage, which persists data in the
./chroma_dbdirectory - For image processing, make sure Tesseract OCR is properly installed and accessible
- Large documents may take some time to process
- The quality of answers depends on the quality of the input documents and text extraction
- Maximum file size depends on your system's memory
- Image OCR quality depends on the image quality and Tesseract's capabilities
- Web crawling may be limited by website robots.txt rules