HydroRAG is an agent that goes beyond generic RAG. Through strategic routing, it best addresses users' water quality questions by combining semantic retrieval, llm-generated SQL query, deterministic logic verification and zero-shot llm persuasion. Unlike standard RAG systems that rely solely on similarity matching, HydroRAG uses local water test data and EPA-regulated standards comparison to provide numerically accurate, regulatory-grounded answers. Our architecture intelligently routes queries through structured SQL extraction, vector similarity search, and two-tier web search—then leveraging llm-generated persuasive strategies, validates all claims against authoritative standards before synthesizing the final response.
- Query Validation & Strategy Extraction: Validates water-related queries and rejects out-of-scope questions to save compute.
- Routing RAG Retrieval Path:
- . Structured (Text-to-SQL): Queries local contamination database using LLM-generated SQL for precise measurements
- Data Sufficiency Check (Semantic Ranking): Uses embedding-based semantic similarity to rank the retrieved results by relevance to the question. This evaluates if retrieved local data can fully answer the question
- Conditional Web Search: If local data is insufficient, triggers two-tier web search (curated trusted sources first, then open web if needed)
- EPA Logic Check: Deterministic comparison against EPA Maximum Contaminant Levels (MCLs) using unit conversion and regulatory standards
- Strategy-Based Generation: Extracts persuasive strategies from llm generation, fact-checks them, and merges into final response
- ZIP Code-Specific Queries: Extracts ZIP codes from queries and retrieves location-specific contamination measurements
- Contaminant Extraction: Identifies specific contaminants mentioned in queries for targeted retrieval
- LLM-Generated SQL: Uses LLM to generate precise SQL queries for complex contamination data retrieval
- General Statistics: Provides aggregate statistics when specific location data isn't available
- Deterministic Verification: All contamination measurements are automatically compared against EPA Maximum Contaminant Levels (MCLs)
- Unit Conversion: Handles automatic unit normalization (mg/L, ppm, ppb) for accurate regulatory comparisons
- Exceedance Detection: Identifies contaminants that exceed EPA limits with precise exceedance ratios
- Safety Assessment: Provides risk-level classifications (CRITICAL, HIGH, MODERATE, LOW, SAFE) based on EPA standards
- Transparent Reporting: Generates comprehensive summaries showing both exceedances and contaminants within safe limits
- Curated Sources First: Prioritizes trusted domains (EPA.gov, CDC.gov, WHO.int, academic sources) with LLM-based sufficiency evaluation
- Open Web Fallback: Searches broader web only if curated sources are insufficient
- Caching: Caches search results to reduce API calls and improve response time
- Intelligent Query Routing: Automatically determines whether to use structured SQL queries, vector similarity search, or web search
- Confidence Score: Transparency to users by showing the confidence score; Confidence score used to guide through pipeline steps to save compute
- Early Exit: Directly use LLM to build response if RAG data are sufficient
- Strategy Extraction: Decomposes LLM-generated responses into distinct persuasive strategies using dynamic labeling
- Fact-Check & Information Retrieval: Validates each strategy claim against retrieved evidence using IR methods
- Strategy Merging: Intelligently combines verified strategies with facts using frequency-based scoring and semantic alignment
- Multi-Turn Context: Maintains conversation history for natural follow-up questions with location awareness
# Clone the repository
git clone <repository-url>
cd WaterBot
# Install dependencies
pip install -r requirements.txt
4. **Set up data directories**
```bash
mkdir -p data/raw data/processedConfigure API by setting the env var in your shell before running main_ui.py:
export AZURE_OPENAI_API_KEY='YOUR_KEY_HERE'
# if needed:
export AZURE_OPENAI_ENDPOINT='https://hana2025aut.openai.azure.com/'
export AZURE_OPENAI_DEPLOYMENT_NAME='gpt-4.1' # or your embedding deployment
python main_ui.pyCreate or edit config.yaml:
embeddings:
enable_embeddings: true
embedding_model: 'text-embedding-ada-002' # Your Azure embedding deployment name
web_search:
provider: 'tavily'
api_key: 'your-tavily-key'
database:
path: 'data/water_facts.db'# Load your contamination data
python scripts/load_local_data.pypython main_UI.pyThen open your browser use this URL (http://localhost:7860)
python main_terminal.py- Python 3.8+ installed
- Dependencies installed (
pip install -r requirements.txt) -
config.yamlconfigured with API keys OR environment variables set - Data directories created (
mkdir -p data/raw data/processed) - Local data loaded (
python scripts/load_local_data.py) - choose an entry point (
python main_UI.pyorpython main_terminal.py)
- Python 3.8+ (Python 3.10+ recommended)
- Azure OpenAI API key (required for LLM)
- Tavily API key (optional, for web search - can use mock mode)
- Excel file with contamination data (for local data features)
See requirements.txt for full dependency list.
[Specify your license]
If you use HydroRAG in your research, please cite:
@software{hydrorag2025,
title={HydroRAG: Water Quality Assistant Grounding LLM in Local Data and Logic},
author={Your Name},
year={2025},
url={https://github.com/yourusername/hydrorag}
}