Skip to content

Shams261/Verity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🇨🇦 Verity - Finding Truth Within the Noise

Live Demo GitHub Python React FastAPI

A Retrieval-Augmented Generation (RAG) system for analyzing 3M+ Reddit posts from r/Canada

Built for AI Hackathon in the North 2026

FeaturesDemoArchitectureSetupTech Stack


🎯 What is Verity?

Verity is an AI-powered platform that helps users understand Canadian public discourse by analyzing millions of Reddit posts. Using advanced RAG (Retrieval-Augmented Generation) technology with Groq's ultra-fast LLM inference, it retrieves relevant discussions and generates insightful answers backed by real community conversations.

🌟 Key Highlights

  • 🔍 Semantic Search - Vector similarity search across 2.97M+ Reddit posts
  • Lightning Fast - Powered by Groq's llama-3.3-70b-versatile for sub-second responses
  • 🤖 AI Insights - Context-aware answer generation with source attribution
  • 📊 Real-time Stats - Live API usage tracking and dataset statistics
  • 🎨 Beautiful UI - Modern React interface with gradient glassmorphic design

🚀 Live Demo

Backend API

🔗 https://shams0026-canadaconvo-backend.hf.space

Try these endpoints:

Example Query

curl -X POST "https://shams0026-canadaconvo-backend.hf.space/api/v1/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are Canadians saying about healthcare?",
    "top_k": 10
  }'

✨ Features

🎯 Core Capabilities

🔍 Intelligent Search

  • Semantic understanding of natural language queries
  • Vector similarity search using sentence-transformers
  • Context-aware retrieval from 2.97M posts
  • Relevance scoring and ranking

🤖 AI-Powered Analysis

  • Groq API integration for ultra-fast inference
  • Gemini API fallback for reliability
  • Source-backed answers with citations
  • Comprehensive sentiment analysis

⚡ Performance Optimized

  • Response caching for instant results
  • Efficient batch processing
  • ChromaDB for fast vector operations
  • Sub-second query response times

🎨 Modern Interface

  • Gradient glassmorphic design
  • Real-time statistics dashboard
  • Responsive layout for all devices
  • Smooth animations and transitions

🏗️ Architecture

System Overview

┌─────────────────────────────────────────────────────────────┐
│                    FRONTEND (React + Vite)                  │
│  ┌────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   Chat     │  │   Message    │  │    Source    │       │
│  │ Interface  │→ │   Display    │→ │    Cards     │       │
│  └────────────┘  └──────────────┘  └──────────────┘       │
│         ↓                                                   │
│    API Client (services/api.ts)                            │
└─────────────────────────────────────────────────────────────┘
                          ↓ HTTPS
┌─────────────────────────────────────────────────────────────┐
│              BACKEND (FastAPI on HF Space)                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │            API Layer (/api/v1)                       │  │
│  │   /query  |  /health  |  /stats                     │  │
│  └──────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                  RAG Engine                          │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐    │  │
│  │  │  Embed     │→ │  Retrieve  │→ │  Generate  │    │  │
│  │  │  Query     │  │  Context   │  │  Response  │    │  │
│  │  └────────────┘  └────────────┘  └────────────┘    │  │
│  └──────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │  Groq API    │  │   ChromaDB   │  │ Embeddings   │    │
│  │ (Primary)    │  │  Vector DB   │  │   Service    │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────────────────────────┘

RAG Pipeline Flow

Step-by-Step:

  1. Query Embedding → Convert user query to 384-dim vector using sentence-transformers
  2. Vector Search → Find top-K similar posts from 2.97M embeddings in ChromaDB
  3. Context Building → Structure retrieved posts into a comprehensive prompt
  4. LLM Generation → Groq's llama-3.3-70b generates insights from context
  5. Response Formatting → Return answer with source citations and metadata

🛠️ Tech Stack

Backend

Technology Purpose Version
FastAPI Web framework 0.104.1
ChromaDB Vector database 0.4.18
Groq API LLM inference (Primary) llama-3.3-70b-versatile
Gemini API LLM inference (Fallback) gemini-2.5-flash
sentence-transformers Text embeddings all-MiniLM-L6-v2 (384-dim)
Uvicorn ASGI server Latest
Pydantic Data validation 2.x
Hugging Face Spaces Deployment Docker SDK

Frontend

Technology Purpose Version
React UI framework 19.2
TypeScript Type safety 5.9
Vite Build tool 7.2.4
Tailwind CSS Styling 3.4.1
Lucide React Icons Latest

Data Pipeline

  • Python scripts for ETL
  • Pandas & NumPy for data processing
  • ChromaDB persistent vector store (16GB)
  • Pre-computed embeddings for 2.97M posts

📊 Dataset

Reddit Canada Dataset

  • Source: r/Canada subreddit
  • Total Posts: 2,972,749
  • Time Period: Historical discussions
  • Embedding Dimensions: 384
  • Vector Database Size: 16GB
  • Metadata: Titles, scores, comments, years

Statistics

{
  "total_posts": 2972749,
  "embedding_dimension": 384,
  "vector_db_size": "16GB",
  "model": "all-MiniLM-L6-v2",
  "database": "ChromaDB"
}

🚀 Quick Start

Prerequisites

  • Python 3.12+
  • Node.js 18+
  • Git
  • 8GB+ RAM (for running locally with full dataset)

Backend Setup

# 1. Clone repository
git clone https://github.com/Shams261/Verity.git
cd Verity/canadaconvo-backend

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env
# Edit .env and add your API keys:
# - GROQ_API_KEY=your_groq_key_here
# - HACKATHON_API_KEY=your_gemini_key_here (optional fallback)

# 5. Start server
python app.py

Backend will be running at http://localhost:8000

API Documentation: http://localhost:8000/docs

Frontend Setup

# 1. Navigate to frontend
cd ../canadaconvo-frontend

# 2. Install dependencies
npm install

# 3. Configure API endpoint (if needed)
# Create .env.local with:
# VITE_API_URL=http://localhost:8000/api/v1

# 4. Start development server
npm run dev

Frontend will be running at http://localhost:5173


🔧 Configuration

Environment Variables

Backend (.env)

# LLM Configuration
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=llama-3.3-70b-versatile
USE_GROQ=true

# Fallback LLM (optional)
HACKATHON_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash

# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_DIM=384

# RAG Settings
RETRIEVAL_TOP_K=50
RERANK_TOP_K=20

# Server
HOST=0.0.0.0
PORT=8000

Frontend (.env.local)

# Backend API URL
VITE_API_URL=https://shams0026-canadaconvo-backend.hf.space/api/v1

📚 API Documentation

Base URL

https://shams0026-canadaconvo-backend.hf.space/api/v1

Endpoints

1. Query Endpoint

POST /query

Submit a natural language query to analyze Canadian discourse.

Request:

{
  "query": "What are Canadians saying about housing affordability?",
  "top_k": 20
}

Response:

{
  "query": "What are Canadians saying about housing affordability?",
  "answer": "Based on analyzing discussions from r/Canada...",
  "sources": [
    {
      "id": "post_123",
      "title": "Housing crisis discussion",
      "text": "Post content...",
      "score": 1250,
      "year": "2023",
      "num_comments": 342
    }
  ],
  "metadata": {
    "num_sources": 20,
    "cached": false,
    "api_calls_used": 4,
    "api_calls_remaining": 999999
  }
}

2. Health Check

GET /health

Check service status.

Response:

{
  "status": "healthy",
  "service": "CanadaConvo",
  "version": "1.0.0"
}

3. Statistics

GET /stats

Get system statistics.

Response:

{
  "total_posts": 2972749,
  "api_calls_used": 4,
  "api_calls_remaining": 999999,
  "embedding_model": "all-MiniLM-L6-v2",
  "llm_provider": "Groq",
  "llm_model": "llama-3.3-70b-versatile",
  "using_groq": true
}

📁 Project Structure

Verity/
├── canadaconvo-backend/          # Python FastAPI Backend
│   ├── app/
│   │   ├── api/                  # API endpoints
│   │   │   └── v1/
│   │   │       └── endpoints/
│   │   │           ├── query.py      # POST /query
│   │   │           └── health.py     # GET /health, /stats
│   │   ├── core/                 # Core business logic
│   │   │   └── rag_engine.py        # Main RAG implementation
│   │   ├── db/                   # Database clients
│   │   │   └── chroma_client.py     # ChromaDB integration
│   │   ├── models/               # Pydantic models
│   │   │   ├── query.py             # Request/Response models
│   │   │   └── post.py              # Post model
│   │   ├── services/             # External services
│   │   │   ├── embedding_service.py # Embeddings
│   │   │   ├── groq_service.py      # Groq API client
│   │   │   └── gemini_service.py    # Gemini API client
│   │   └── utils/                # Utilities
│   │       └── data_loader.py       # Data management
│   ├── config/
│   │   └── settings.py           # Configuration
│   ├── data/                     # Data files (16GB+)
│   │   ├── processed/
│   │   │   ├── embeddings.npy       # Pre-computed embeddings
│   │   │   └── metadata.parquet     # Post metadata
│   │   └── chroma/               # ChromaDB persistent storage
│   ├── scripts/                  # Data processing scripts
│   │   ├── 02_clean_data.py
│   │   ├── 03_generate_embeddings.py
│   │   └── 04_build_indexes.py
│   ├── .env                      # Environment variables (gitignored)
│   ├── .env.example              # Template for env vars
│   ├── Dockerfile                # Docker configuration
│   ├── requirements.txt          # Python dependencies
│   ├── app.py                    # Server entry point
│   └── README.md                 # Backend documentation
│
├── canadaconvo-frontend/         # React + TypeScript Frontend
│   ├── src/
│   │   ├── components/
│   │   │   ├── ChatInterface.tsx    # Main chat UI
│   │   │   ├── MessageDisplay.tsx   # Message rendering
│   │   │   ├── SourceCard.tsx       # Citation cards
│   │   │   └── StatsPanel.tsx       # Header stats
│   │   ├── services/
│   │   │   └── api.ts               # API client
│   │   ├── types/
│   │   │   └── index.ts             # TypeScript types
│   │   ├── App.tsx                  # Root component
│   │   ├── main.tsx                 # Entry point
│   │   └── index.css                # Global styles
│   ├── public/                   # Static assets
│   ├── package.json
│   ├── tsconfig.json
│   ├── vite.config.ts            # Vite configuration
│   └── tailwind.config.js        # Tailwind config
│
├── .gitignore                    # Git ignore rules
└── README.md                     # This file

🔄 How It Works

End-to-End Flow

1. USER INPUT
   └─ User types: "What issues concern Canadians most?"

2. FRONTEND
   └─ POST /api/v1/query

3. BACKEND API
   └─ Receives QueryRequest

4. RAG ENGINE
   ├─ Embedding Generation
   │  └─ Convert query to 384-dim vector
   │
   ├─ Vector Search
   │  └─ Find top 20 similar posts in ChromaDB
   │
   ├─ Context Building
   │  └─ Structure prompt with retrieved posts
   │
   ├─ LLM Generation
   │  └─ Groq API generates insights
   │
   └─ Response Formatting
      └─ Return {answer, sources, metadata}

5. FRONTEND DISPLAY
   ├─ Render AI answer
   ├─ Show source cards
   └─ Display metadata

6. USER SEES RESULT
   └─ Comprehensive answer with citations

Performance Metrics

Metric Value
Dataset Size 2.97M posts
Embedding Dimension 384
First Query ~3-5s (with Groq)
Cached Query <1s
Vector Search ~100ms
LLM Inference ~2-3s (Groq)
Default Top-K 20 posts

🎨 Features in Detail

1. Semantic Search

Uses sentence-transformers to convert text into 384-dimensional vectors, enabling semantic understanding beyond keyword matching.

2. Groq-Powered Inference

Leverages Groq's ultra-fast LLM infrastructure for sub-second response generation with llama-3.3-70b-versatile.

3. Source Attribution

Every answer includes citations to original Reddit posts, ensuring transparency and verifiability.

4. Smart Caching

Frequently asked questions are cached to provide instant responses and reduce API costs.

5. Real-time Statistics

Live dashboard showing:

  • Total posts in dataset (2.97M)
  • API usage tracking
  • Active LLM provider (Groq/Gemini)

🚢 Deployment

Hugging Face Space (Backend)

The backend is deployed on Hugging Face Spaces using Docker:

title: Verity Backend API
emoji: 🍁
sdk: docker
app_port: 7860

Secrets Configuration:

  • GROQ_API_KEY - Your Groq API key
  • HACKATHON_API_KEY - Gemini API key (optional)
  • HF_TOKEN - Hugging Face token for dataset access

Frontend Deployment

The frontend can be deployed on:

  • Vercel (Recommended)
  • Netlify
  • GitHub Pages
  • Any static hosting service

Environment Variable:

VITE_API_URL=https://shams0026-canadaconvo-backend.hf.space/api/v1

🔐 Security

API Key Protection

  • ✅ All API keys stored in environment variables
  • .env files gitignored
  • ✅ No hardcoded credentials in code
  • ✅ Secrets managed via HF Space settings

CORS Configuration

Backend allows requests from:

  • localhost:3000 (React dev)
  • localhost:5173 (Vite dev)
  • *.vercel.app (Vercel deployments)
  • *.hf.space (HF Space frontend)

🤝 Contributing

We welcome contributions! Here's how:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Commit Message Convention

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation changes
  • style: Code style changes
  • refactor: Code refactoring
  • test: Test additions/changes
  • chore: Build process or tool changes

🐛 Troubleshooting

Backend Issues

Problem: ChromaDB collection not found

Solution: Ensure data files are downloaded
The app will auto-download on first run

Problem: Groq API errors

Solution: Check your GROQ_API_KEY in .env
Verify you have API credits remaining

Problem: Out of memory

Solution: Reduce RETRIEVAL_TOP_K in config
Or increase available RAM (8GB+ recommended)

Frontend Issues

Problem: API connection refused

Solution: Verify VITE_API_URL in .env.local
Check backend is running

Problem: Build errors

Solution: Clear node_modules and reinstall
rm -rf node_modules package-lock.json
npm install

📜 License

This project was created for the AI Hackathon in the North 2026.

Organized by

The AI Collective - Thunder Bay


🙏 Acknowledgments

  • The AI Collective - Thunder Bay - For organizing the AI Hackathon in the North
  • Hackathon Organizers - For providing API access and resources
  • r/Canada Community - For the rich discussion dataset
  • Groq - For ultra-fast LLM inference
  • ChromaDB - For efficient vector storage
  • Sentence Transformers - For high-quality embeddings
  • FastAPI - For the excellent Python web framework
  • React Team - For the amazing frontend library
  • Hugging Face - For hosting infrastructure

👥 Team

Built with ❤️ for the AI Hackathon in the North 2026


📧 Contact

For questions, suggestions, or feedback:


🌟 Star History

If you find this project useful, please consider giving it a ⭐!


Made with 🇨🇦 by the Verity Team

GitHub stars GitHub forks GitHub watchers

About

A Retrieval-Augmented Generation (RAG) system for analyzing 3M+ Reddit posts from r/Canada

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors