Skip to content

HanaMLiu/WaterBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HydroRAG: Water Quality Assistant Grounding LLM persuasion in Local Data and Logic

HydroRAG is an agent that goes beyond generic RAG. Through strategic routing, it best addresses users' water quality questions by combining semantic retrieval, llm-generated SQL query, deterministic logic verification and zero-shot llm persuasion. Unlike standard RAG systems that rely solely on similarity matching, HydroRAG uses local water test data and EPA-regulated standards comparison to provide numerically accurate, regulatory-grounded answers. Our architecture intelligently routes queries through structured SQL extraction, vector similarity search, and two-tier web search—then leveraging llm-generated persuasive strategies, validates all claims against authoritative standards before synthesizing the final response.

Architecture Flow

  1. Query Validation & Strategy Extraction: Validates water-related queries and rejects out-of-scope questions to save compute.
  2. Routing RAG Retrieval Path:
    • . Structured (Text-to-SQL): Queries local contamination database using LLM-generated SQL for precise measurements
    • Data Sufficiency Check (Semantic Ranking): Uses embedding-based semantic similarity to rank the retrieved results by relevance to the question. This evaluates if retrieved local data can fully answer the question
    • Conditional Web Search: If local data is insufficient, triggers two-tier web search (curated trusted sources first, then open web if needed)
  3. EPA Logic Check: Deterministic comparison against EPA Maximum Contaminant Levels (MCLs) using unit conversion and regulatory standards
  4. Strategy-Based Generation: Extracts persuasive strategies from llm generation, fact-checks them, and merges into final response

Key Features

📊 Local Data Priority

  • ZIP Code-Specific Queries: Extracts ZIP codes from queries and retrieves location-specific contamination measurements
  • Contaminant Extraction: Identifies specific contaminants mentioned in queries for targeted retrieval
  • LLM-Generated SQL: Uses LLM to generate precise SQL queries for complex contamination data retrieval
  • General Statistics: Provides aggregate statistics when specific location data isn't available

🛡️ EPA Logic Check

  • Deterministic Verification: All contamination measurements are automatically compared against EPA Maximum Contaminant Levels (MCLs)
  • Unit Conversion: Handles automatic unit normalization (mg/L, ppm, ppb) for accurate regulatory comparisons
  • Exceedance Detection: Identifies contaminants that exceed EPA limits with precise exceedance ratios
  • Safety Assessment: Provides risk-level classifications (CRITICAL, HIGH, MODERATE, LOW, SAFE) based on EPA standards
  • Transparent Reporting: Generates comprehensive summaries showing both exceedances and contaminants within safe limits

🔍 Two-Tier Web Search

  • Curated Sources First: Prioritizes trusted domains (EPA.gov, CDC.gov, WHO.int, academic sources) with LLM-based sufficiency evaluation
  • Open Web Fallback: Searches broader web only if curated sources are insufficient
  • Caching: Caches search results to reduce API calls and improve response time

🧠 Strategy Routing

  • Intelligent Query Routing: Automatically determines whether to use structured SQL queries, vector similarity search, or web search
  • Confidence Score: Transparency to users by showing the confidence score; Confidence score used to guide through pipeline steps to save compute
  • Early Exit: Directly use LLM to build response if RAG data are sufficient

👾 Persuasion & Fact Check

  • Strategy Extraction: Decomposes LLM-generated responses into distinct persuasive strategies using dynamic labeling
  • Fact-Check & Information Retrieval: Validates each strategy claim against retrieved evidence using IR methods
  • Strategy Merging: Intelligently combines verified strategies with facts using frequency-based scoring and semantic alignment
  • Multi-Turn Context: Maintains conversation history for natural follow-up questions with location awareness

Setup

Installation

# Clone the repository
git clone <repository-url>
cd WaterBot

# Install dependencies
pip install -r requirements.txt

4. **Set up data directories**
   ```bash
   mkdir -p data/raw data/processed

Configuration

Configure API by setting the env var in your shell before running main_ui.py:

export AZURE_OPENAI_API_KEY='YOUR_KEY_HERE'
# if needed:
export AZURE_OPENAI_ENDPOINT='https://hana2025aut.openai.azure.com/'
export AZURE_OPENAI_DEPLOYMENT_NAME='gpt-4.1'   # or your embedding deployment
python main_ui.py

Create or edit config.yaml:

embeddings:
  enable_embeddings: true
  embedding_model: 'text-embedding-ada-002'  # Your Azure embedding deployment name

web_search:
  provider: 'tavily'
  api_key: 'your-tavily-key'

database:
  path: 'data/water_facts.db'

Load Local Data

# Load your contamination data
python scripts/load_local_data.py

Run

python main_UI.py

Then open your browser use this URL (http://localhost:7860)

Terminal Interface

python main_terminal.py

Quick Start Checklist

  • Python 3.8+ installed
  • Dependencies installed (pip install -r requirements.txt)
  • config.yaml configured with API keys OR environment variables set
  • Data directories created (mkdir -p data/raw data/processed)
  • Local data loaded (python scripts/load_local_data.py)
  • choose an entry point (python main_UI.py or python main_terminal.py)

Requirements

  • Python 3.8+ (Python 3.10+ recommended)
  • Azure OpenAI API key (required for LLM)
  • Tavily API key (optional, for web search - can use mock mode)
  • Excel file with contamination data (for local data features)

See requirements.txt for full dependency list.

License

[Specify your license]

Citation

If you use HydroRAG in your research, please cite:

@software{hydrorag2025,
  title={HydroRAG: Water Quality Assistant Grounding LLM in Local Data and Logic},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/hydrorag}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages