A production-ready Retrieval-Augmented Generation (RAG) system built with LangChain, Pinecone, Google Gemini, and Streamlit.
- Document Upload: Support for
.txt,.pdf, and.docxfiles - Intelligent Chunking: Automatic document chunking with overlap for better context
- Vector Embeddings: Google Gemini embeddings for semantic understanding
- Vector Database: Pinecone for scalable vector storage and retrieval
- Conversational AI: Google Gemini LLM with RAG for accurate, context-grounded responses
- Web Interface: Streamlit UI for easy interaction
- Hallucination Prevention: Built-in safeguards to prevent making up information
- Source Attribution: Answers include references to source documents
- Python 3.8+
- Active Pinecone account with API key
- Google Cloud API key with Generative AI access
git clone <repository-url>
cd rag-project# On Windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txt- Copy the
.env.templateto.env:
cp .env.template .env- Edit
.envand add your API keys:
GOOGLE_API_KEY=your_google_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=your_pinecone_environment_here
PINECONE_INDEX_NAME=rag-documents-index- Go to Google AI Studio
- Click "Create API Key"
- Copy the key to your
.envfile
- Sign up at Pinecone
- Create a project and get your API key
- Note your environment (e.g.,
us-east-1) - Add both to
.envfile
streamlit run app.pyThen:
- Upload your documents (supports multiple files)
- Click "Process Documents"
- Ask questions about your documents in the chat interface
python main.py initThis creates a Pinecone index for storing document embeddings.
# Process a single file
python main.py process path/to/document.txt
# Process all documents in a directory
python main.py process path/to/documents/
# Process with a specific namespace (for organization)
python main.py process path/to/documents/ --namespace my-projectSupported formats: .txt, .pdf, .docx
rag-project/
βββ src/
β βββ config/
β β βββ __init__.py
β β βββ config.py # Configuration management
β βββ rag/
β β βββ __init__.py
β β βββ pinecone_manager.py # Pinecone operations
β β βββ embedding_service.py # Google Gemini embeddings
β β βββ document_processor.py # Document processing pipeline
β β βββ rag_chain.py # RAG retrieval and generation
β βββ utils/
β βββ __init__.py
β βββ helpers.py # Utility functions and logging
β βββ chunking.py # Document chunking
β βββ text_processor.py # Text extraction from files
βββ app.py # Streamlit web interface
βββ main.py # CLI entry point
βββ .env.template # Environment variables template
βββ .env # Environment variables (create from template)
βββ requirements.txt # Python dependencies
βββ README.md # This file
All configuration is managed through the src/config/config.py file and environment variables in .env:
# Google Gemini Configuration
GOOGLE_API_KEY=your_key_here
GOOGLE_MODEL_NAME=gemini-pro
# Pinecone Configuration
PINECONE_API_KEY=your_key_here
PINECONE_ENVIRONMENT=us-east-1
PINECONE_INDEX_NAME=rag-documents-index
# RAG Parameters
CHUNK_SIZE=1000 # Size of text chunks
CHUNK_OVERLAP=200 # Overlap between chunks
RETRIEVAL_TOP_K=5 # Number of results to retrieve- Document Upload: User uploads one or more files (
.txt,.pdf,.docx) - Text Extraction: System extracts text using appropriate loaders
- Chunking: Text is split into overlapping chunks for better context
- Embedding: Each chunk is converted to a vector using Google Gemini embeddings
- Storage: Vectors are stored in Pinecone with metadata
- Retrieval: User questions are embedded and matched against stored vectors
- Generation: Top-K relevant chunks are used as context for LLM
- Response: Google Gemini generates an answer based only on retrieved context
- Context Verification: Answers are generated only from retrieved documents
- Hallucination Prevention: Built-in prompts instruct the model to refuse answering out-of-scope questions
- Source Attribution: All answers include references to source documents
- Logging: Comprehensive logging of all operations for debugging
After uploading documents, you can ask:
- "What is the main topic of this document?"
- "Summarize the key points from the uploaded files"
- "What does the document say about [specific topic]?"
- "Find information about [topic] in the uploaded documents"
The system will respond with "I don't have information in the uploaded documents to answer that." for:
- Questions about topics not covered in uploaded documents
- Requests for information from the internet or external sources
- General knowledge questions unrelated to the documents
Solution: Ensure .env file exists and contains all required API keys:
cp .env.template .env
# Edit .env with your API keysSolution: Check that:
PINECONE_API_KEYis correctPINECONE_ENVIRONMENTis correct (e.g.,us-east-1)- Internet connection is active
Solution: Check that:
GOOGLE_API_KEYis correct- Your API key has Generative AI access enabled
Solution: Ensure all dependencies are installed:
pip install -r requirements.txt- Chunk Size: Increase
CHUNK_SIZEfor faster processing, decrease for more precise retrieval - Overlap: Increase
CHUNK_OVERLAPfor better context continuity - Retrieval: Increase
RETRIEVAL_TOP_Kfor more comprehensive answers - Namespaces: Use different namespaces to organize documents by project
To update the indexed documents:
- Clear the index: Process new documents (they will be added to the same index)
- Delete specific documents: Use the Streamlit interface "Clear All Data" button
- Rebuild index: Delete the Pinecone index and run
python main.py initagain
Logs are output to the console with the following format:
2024-01-15 10:30:45,123 - module_name - INFO - Log message
Adjust LOG_LEVEL in .env to control verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL
To test the RAG system with sample documents:
- Create a
sample_docs/directory:
mkdir sample_docs-
Add test
.txtfiles tosample_docs/ -
Process them:
python main.py process sample_docs/- Start Streamlit and ask questions:
streamlit run app.pyKey dependencies (see requirements.txt for complete list):
- LangChain: RAG framework and NLP utilities
- Pinecone: Vector database
- Google Generative AI: Embeddings and LLM
- Streamlit: Web interface
- PyPDF2: PDF processing
- python-docx: DOCX processing
- python-dotenv: Environment variable management
This project is provided as-is for educational and commercial use.
Contributions are welcome! Please feel free to submit pull requests for improvements.
For issues or questions:
- Check the Troubleshooting section above
- Review logs for error details
- Check API key configurations
- Ensure all dependencies are installed
Built with β€οΈ using LangChain, Pinecone, Google Gemini, and Streamlit