Skip to content

MYounus-Codes/full-stack-python-rag-project

Repository files navigation

RAG Document Assistant

A production-ready Retrieval-Augmented Generation (RAG) system built with LangChain, Pinecone, Google Gemini, and Streamlit.

🎯 Features

  • Document Upload: Support for .txt, .pdf, and .docx files
  • Intelligent Chunking: Automatic document chunking with overlap for better context
  • Vector Embeddings: Google Gemini embeddings for semantic understanding
  • Vector Database: Pinecone for scalable vector storage and retrieval
  • Conversational AI: Google Gemini LLM with RAG for accurate, context-grounded responses
  • Web Interface: Streamlit UI for easy interaction
  • Hallucination Prevention: Built-in safeguards to prevent making up information
  • Source Attribution: Answers include references to source documents

πŸ“‹ System Requirements

  • Python 3.8+
  • Active Pinecone account with API key
  • Google Cloud API key with Generative AI access

πŸš€ Installation

1. Clone the Repository

git clone <repository-url>
cd rag-project

2. Create Virtual Environment

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

  1. Copy the .env.template to .env:
cp .env.template .env
  1. Edit .env and add your API keys:
GOOGLE_API_KEY=your_google_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=your_pinecone_environment_here
PINECONE_INDEX_NAME=rag-documents-index

How to Get API Keys

Google Gemini API Key

  1. Go to Google AI Studio
  2. Click "Create API Key"
  3. Copy the key to your .env file

Pinecone API Key

  1. Sign up at Pinecone
  2. Create a project and get your API key
  3. Note your environment (e.g., us-east-1)
  4. Add both to .env file

πŸ“– Usage

Method 1: Streamlit Web Interface (Recommended)

streamlit run app.py

Then:

  1. Upload your documents (supports multiple files)
  2. Click "Process Documents"
  3. Ask questions about your documents in the chat interface

Method 2: Command Line

Initialize Pinecone Index

python main.py init

This creates a Pinecone index for storing document embeddings.

Process Documents

# Process a single file
python main.py process path/to/document.txt

# Process all documents in a directory
python main.py process path/to/documents/

# Process with a specific namespace (for organization)
python main.py process path/to/documents/ --namespace my-project

Supported formats: .txt, .pdf, .docx

πŸ“ Project Structure

rag-project/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── config.py              # Configuration management
β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ pinecone_manager.py     # Pinecone operations
β”‚   β”‚   β”œβ”€β”€ embedding_service.py    # Google Gemini embeddings
β”‚   β”‚   β”œβ”€β”€ document_processor.py   # Document processing pipeline
β”‚   β”‚   └── rag_chain.py            # RAG retrieval and generation
β”‚   └── utils/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ helpers.py              # Utility functions and logging
β”‚       β”œβ”€β”€ chunking.py             # Document chunking
β”‚       └── text_processor.py       # Text extraction from files
β”œβ”€β”€ app.py                          # Streamlit web interface
β”œβ”€β”€ main.py                         # CLI entry point
β”œβ”€β”€ .env.template                   # Environment variables template
β”œβ”€β”€ .env                            # Environment variables (create from template)
β”œβ”€β”€ requirements.txt                # Python dependencies
└── README.md                       # This file

πŸ”§ Configuration

All configuration is managed through the src/config/config.py file and environment variables in .env:

# Google Gemini Configuration
GOOGLE_API_KEY=your_key_here
GOOGLE_MODEL_NAME=gemini-pro

# Pinecone Configuration
PINECONE_API_KEY=your_key_here
PINECONE_ENVIRONMENT=us-east-1
PINECONE_INDEX_NAME=rag-documents-index

# RAG Parameters
CHUNK_SIZE=1000              # Size of text chunks
CHUNK_OVERLAP=200            # Overlap between chunks
RETRIEVAL_TOP_K=5            # Number of results to retrieve

🎯 How It Works

  1. Document Upload: User uploads one or more files (.txt, .pdf, .docx)
  2. Text Extraction: System extracts text using appropriate loaders
  3. Chunking: Text is split into overlapping chunks for better context
  4. Embedding: Each chunk is converted to a vector using Google Gemini embeddings
  5. Storage: Vectors are stored in Pinecone with metadata
  6. Retrieval: User questions are embedded and matched against stored vectors
  7. Generation: Top-K relevant chunks are used as context for LLM
  8. Response: Google Gemini generates an answer based only on retrieved context

πŸ›‘οΈ Safety Features

  • Context Verification: Answers are generated only from retrieved documents
  • Hallucination Prevention: Built-in prompts instruct the model to refuse answering out-of-scope questions
  • Source Attribution: All answers include references to source documents
  • Logging: Comprehensive logging of all operations for debugging

πŸ” Example Queries

After uploading documents, you can ask:

  • "What is the main topic of this document?"
  • "Summarize the key points from the uploaded files"
  • "What does the document say about [specific topic]?"
  • "Find information about [topic] in the uploaded documents"

❌ Unsupported Queries

The system will respond with "I don't have information in the uploaded documents to answer that." for:

  • Questions about topics not covered in uploaded documents
  • Requests for information from the internet or external sources
  • General knowledge questions unrelated to the documents

🚨 Troubleshooting

"Missing required configuration" Error

Solution: Ensure .env file exists and contains all required API keys:

cp .env.template .env
# Edit .env with your API keys

"Failed to connect to Pinecone" Error

Solution: Check that:

  • PINECONE_API_KEY is correct
  • PINECONE_ENVIRONMENT is correct (e.g., us-east-1)
  • Internet connection is active

"No embeddings generated" Error

Solution: Check that:

  • GOOGLE_API_KEY is correct
  • Your API key has Generative AI access enabled

Streamlit not starting

Solution: Ensure all dependencies are installed:

pip install -r requirements.txt

πŸ“Š Performance Tips

  1. Chunk Size: Increase CHUNK_SIZE for faster processing, decrease for more precise retrieval
  2. Overlap: Increase CHUNK_OVERLAP for better context continuity
  3. Retrieval: Increase RETRIEVAL_TOP_K for more comprehensive answers
  4. Namespaces: Use different namespaces to organize documents by project

πŸ”„ Updating Documents

To update the indexed documents:

  1. Clear the index: Process new documents (they will be added to the same index)
  2. Delete specific documents: Use the Streamlit interface "Clear All Data" button
  3. Rebuild index: Delete the Pinecone index and run python main.py init again

πŸ“ Logging

Logs are output to the console with the following format:

2024-01-15 10:30:45,123 - module_name - INFO - Log message

Adjust LOG_LEVEL in .env to control verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL

πŸ§ͺ Testing

To test the RAG system with sample documents:

  1. Create a sample_docs/ directory:
mkdir sample_docs
  1. Add test .txt files to sample_docs/

  2. Process them:

python main.py process sample_docs/
  1. Start Streamlit and ask questions:
streamlit run app.py

πŸ“š Dependencies

Key dependencies (see requirements.txt for complete list):

  • LangChain: RAG framework and NLP utilities
  • Pinecone: Vector database
  • Google Generative AI: Embeddings and LLM
  • Streamlit: Web interface
  • PyPDF2: PDF processing
  • python-docx: DOCX processing
  • python-dotenv: Environment variable management

πŸ“„ License

This project is provided as-is for educational and commercial use.

🀝 Contributing

Contributions are welcome! Please feel free to submit pull requests for improvements.

πŸ“ž Support

For issues or questions:

  1. Check the Troubleshooting section above
  2. Review logs for error details
  3. Check API key configurations
  4. Ensure all dependencies are installed

πŸŽ“ Learning Resources


Built with ❀️ using LangChain, Pinecone, Google Gemini, and Streamlit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published