RAG Document Assistant

A production-ready Retrieval-Augmented Generation (RAG) system built with LangChain, Pinecone, Google Gemini, and Streamlit.

🎯 Features

Document Upload: Support for .txt, .pdf, and .docx files
Intelligent Chunking: Automatic document chunking with overlap for better context
Vector Embeddings: Google Gemini embeddings for semantic understanding
Vector Database: Pinecone for scalable vector storage and retrieval
Conversational AI: Google Gemini LLM with RAG for accurate, context-grounded responses
Web Interface: Streamlit UI for easy interaction
Hallucination Prevention: Built-in safeguards to prevent making up information
Source Attribution: Answers include references to source documents

📋 System Requirements

Python 3.8+
Active Pinecone account with API key
Google Cloud API key with Generative AI access

🚀 Installation

1. Clone the Repository

git clone <repository-url>
cd rag-project

2. Create Virtual Environment

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

Copy the .env.template to .env:

cp .env.template .env

Edit .env and add your API keys:

GOOGLE_API_KEY=your_google_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=your_pinecone_environment_here
PINECONE_INDEX_NAME=rag-documents-index

How to Get API Keys

Google Gemini API Key

Go to Google AI Studio
Click "Create API Key"
Copy the key to your .env file

Pinecone API Key

Sign up at Pinecone
Create a project and get your API key
Note your environment (e.g., us-east-1)
Add both to .env file

📖 Usage

Method 1: Streamlit Web Interface (Recommended)

streamlit run app.py

Then:

Upload your documents (supports multiple files)
Click "Process Documents"
Ask questions about your documents in the chat interface

Method 2: Command Line

Initialize Pinecone Index

python main.py init

This creates a Pinecone index for storing document embeddings.

Process Documents

# Process a single file
python main.py process path/to/document.txt

# Process all documents in a directory
python main.py process path/to/documents/

# Process with a specific namespace (for organization)
python main.py process path/to/documents/ --namespace my-project

Supported formats: .txt, .pdf, .docx

📁 Project Structure

rag-project/
├── src/
│   ├── config/
│   │   ├── __init__.py
│   │   └── config.py              # Configuration management
│   ├── rag/
│   │   ├── __init__.py
│   │   ├── pinecone_manager.py     # Pinecone operations
│   │   ├── embedding_service.py    # Google Gemini embeddings
│   │   ├── document_processor.py   # Document processing pipeline
│   │   └── rag_chain.py            # RAG retrieval and generation
│   └── utils/
│       ├── __init__.py
│       ├── helpers.py              # Utility functions and logging
│       ├── chunking.py             # Document chunking
│       └── text_processor.py       # Text extraction from files
├── app.py                          # Streamlit web interface
├── main.py                         # CLI entry point
├── .env.template                   # Environment variables template
├── .env                            # Environment variables (create from template)
├── requirements.txt                # Python dependencies
└── README.md                       # This file

🔧 Configuration

All configuration is managed through the src/config/config.py file and environment variables in .env:

# Google Gemini Configuration
GOOGLE_API_KEY=your_key_here
GOOGLE_MODEL_NAME=gemini-pro

# Pinecone Configuration
PINECONE_API_KEY=your_key_here
PINECONE_ENVIRONMENT=us-east-1
PINECONE_INDEX_NAME=rag-documents-index

# RAG Parameters
CHUNK_SIZE=1000              # Size of text chunks
CHUNK_OVERLAP=200            # Overlap between chunks
RETRIEVAL_TOP_K=5            # Number of results to retrieve

🎯 How It Works

Document Upload: User uploads one or more files (.txt, .pdf, .docx)
Text Extraction: System extracts text using appropriate loaders
Chunking: Text is split into overlapping chunks for better context
Embedding: Each chunk is converted to a vector using Google Gemini embeddings
Storage: Vectors are stored in Pinecone with metadata
Retrieval: User questions are embedded and matched against stored vectors
Generation: Top-K relevant chunks are used as context for LLM
Response: Google Gemini generates an answer based only on retrieved context

🛡️ Safety Features

Context Verification: Answers are generated only from retrieved documents
Hallucination Prevention: Built-in prompts instruct the model to refuse answering out-of-scope questions
Source Attribution: All answers include references to source documents
Logging: Comprehensive logging of all operations for debugging

🔍 Example Queries

After uploading documents, you can ask:

"What is the main topic of this document?"
"Summarize the key points from the uploaded files"
"What does the document say about [specific topic]?"
"Find information about [topic] in the uploaded documents"

❌ Unsupported Queries

The system will respond with "I don't have information in the uploaded documents to answer that." for:

Questions about topics not covered in uploaded documents
Requests for information from the internet or external sources
General knowledge questions unrelated to the documents

🚨 Troubleshooting

"Missing required configuration" Error

Solution: Ensure .env file exists and contains all required API keys:

cp .env.template .env
# Edit .env with your API keys

"Failed to connect to Pinecone" Error

Solution: Check that:

PINECONE_API_KEY is correct
PINECONE_ENVIRONMENT is correct (e.g., us-east-1)
Internet connection is active

"No embeddings generated" Error

Solution: Check that:

GOOGLE_API_KEY is correct
Your API key has Generative AI access enabled

Streamlit not starting

Solution: Ensure all dependencies are installed:

pip install -r requirements.txt

📊 Performance Tips

Chunk Size: Increase CHUNK_SIZE for faster processing, decrease for more precise retrieval
Overlap: Increase CHUNK_OVERLAP for better context continuity
Retrieval: Increase RETRIEVAL_TOP_K for more comprehensive answers
Namespaces: Use different namespaces to organize documents by project

🔄 Updating Documents

To update the indexed documents:

Clear the index: Process new documents (they will be added to the same index)
Delete specific documents: Use the Streamlit interface "Clear All Data" button
Rebuild index: Delete the Pinecone index and run python main.py init again

📝 Logging

Logs are output to the console with the following format:

2024-01-15 10:30:45,123 - module_name - INFO - Log message

Adjust LOG_LEVEL in .env to control verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL

🧪 Testing

To test the RAG system with sample documents:

Create a sample_docs/ directory:

mkdir sample_docs

Add test .txt files to sample_docs/
Process them:

python main.py process sample_docs/

Start Streamlit and ask questions:

streamlit run app.py

📚 Dependencies

Key dependencies (see requirements.txt for complete list):

LangChain: RAG framework and NLP utilities
Pinecone: Vector database
Google Generative AI: Embeddings and LLM
Streamlit: Web interface
PyPDF2: PDF processing
python-docx: DOCX processing
python-dotenv: Environment variable management

📄 License

This project is provided as-is for educational and commercial use.

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests for improvements.

📞 Support

For issues or questions:

Check the Troubleshooting section above
Review logs for error details
Check API key configurations
Ensure all dependencies are installed

🎓 Learning Resources

Built with ❤️ using LangChain, Pinecone, Google Gemini, and Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
.python-version		.python-version
CHANGES.md		CHANGES.md
COMPLETION_REPORT.md		COMPLETION_REPORT.md
DOCUMENTATION.md		DOCUMENTATION.md
INDEX.md		INDEX.md
Makefile		Makefile
PINECONE_SETUP.md		PINECONE_SETUP.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
main.py		main.py
project-details.txt		project-details.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_project.py		setup_project.py
uv.lock		uv.lock

MYounus-Codes/full-stack-python-rag-project

Folders and files

Latest commit

History

Repository files navigation