A full Retrieval-Augmented Generation (RAG) chatbot for pharmaceutical SDF and PDF documents. Upload drug labels, clinical trial reports, safety data sheets, and pharmacology papers — then ask questions and get sourced, cited answers powered by an open-source LLM.
PDF Upload → OCR / Extract → Chunk + Metadata → Embed → FAISS Index
↓
Answer + Sources ← Zephyr-7B ← Prompt ← Retrieve ← User Query
| Component | Technology |
|---|---|
| PDF Extraction | PyMuPDF (fitz) |
| OCR | Tesseract via pytesseract |
| Chunking | Word-level sliding window (400w, 80 overlap) |
| Metadata | Heuristic doc-type classifier (5 categories) |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
| Vector Store | FAISS IndexFlatIP (cosine similarity) |
| LLM | HuggingFaceH4/zephyr-7b-beta |
| UI | Gradio Blocks |
- SDS — Safety Data Sheets
- Clinical Trial — Randomized trials, efficacy studies
- Drug Label — Prescribing information, contraindications
- Pharmacology — PK/PD, bioavailability, metabolism
- General — Any other pharmaceutical PDF
- 🔬 Auto OCR fallback — detects scanned pages and switches to Tesseract automatically
- 🏷️ Smart metadata tagging — classifies every document on ingest
- 🎯 Confidence scores — per-chunk cosine similarity with visual bars
- 🔀 Document-type filter — scope retrieval to a single category
- 📎 Multi-file upload — ingest multiple PDFs in one session
- 💬 Sourced answers — every claim cited with
[Source N]notation - ⏳ Live status bar — shows retrieval latency and chunk count
MIT



