This project is a Proof of Concept (POC) for a terminal-based intelligent dialog system that implements a Retrieval Augmented Generation (RAG) architecture.
This project validates RAG logic in a terminal environment. It uses LangGraph for orchestration, a Gemini LLM for generation, and Qdrant for vector storage. The system ingests data from external web pages via a sophisticated, resumable pipeline.
- Local Qdrant: Runs a local Docker instance of Qdrant for vector storage.
- Intelligent Web Ingestion Pipeline: A fully automated, resumable pipeline that:
- Crawls: Uses Playwright to fetch JavaScript-rendered HTML from URLs in
data/data_sources.json. - Processes: Extracts clean text from the HTML, preserving structure.
- Translates: Performs document-level translation to English using a Gemini LLM.
- Chunks & Ingests: Splits the translated text into semantically coherent chunks and stores them in Qdrant with metadata.
- Crawls: Uses Playwright to fetch JavaScript-rendered HTML from URLs in
- RAG Agent: A LangGraph agent that performs RAG against the ingested data.
- Terminal Interaction: A simple CLI for interactive Q&A.
- Docker
- Python 3.9+
- uv - Fast Python package installer
- Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh - Start Qdrant:
docker-compose up -d
- Configure API Key:
Create a
.envfile in the project root and add your Gemini API key:GEMINI_API_KEY="your_gemini_api_key_here" - Create Virtual Environment and Install Dependencies:
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -r requirements.txt
- Install Playwright Browsers:
This one-time command is needed to download the browsers for the web crawler.
playwright install
Make sure your virtual environment is activated before running commands:
source .venv/bin/activate # On Windows: .venv\Scripts\activateTo run the entire resumable pipeline (crawl, process, translate, and ingest into Qdrant), use the --crawl flag:
python -m src.main --crawl --dir data/crawled/processedThis single command will manage all steps. If interrupted, you can run it again to resume where it left off.
To test the chunking and ingestion directly, you can run the ingest.py script, although it's recommended to use the main.py script as shown above.
Prerequisites:
- Qdrant must be running (
docker-compose up -d). - Virtual environment must be activated.
- A valid
GEMINI_API_KEYmust be set in your.envfile. - At least one translated English text file (
_en.txt) must exist indata/crawled/processed/.
What to Expect:
The script will ingest new or updated _en.txt files, chunk them, and load them into the web_content collection in Qdrant.
Once data has been ingested, you can start the dialog system for Q&A:
python -m src.main.
├── data/
│ ├── crawled/
│ │ ├── processed/
│ │ ├── structured/
│ │ └── progress.json
│ └── data_sources.json
├── src/
│ ├── agent/
│ ├── gemini/
│ └── ingestion/
├── .gitignore
├── docker-compose.yml
├── GEMINI.md
├── phases/
├── PROGRESS_LOG.md
└── README.md