A Retrieval-Augmented Generation (RAG) system for analyzing 3M+ Reddit posts from r/Canada
Built for AI Hackathon in the North 2026
Features • Demo • Architecture • Setup • Tech Stack
Verity is an AI-powered platform that helps users understand Canadian public discourse by analyzing millions of Reddit posts. Using advanced RAG (Retrieval-Augmented Generation) technology with Groq's ultra-fast LLM inference, it retrieves relevant discussions and generates insightful answers backed by real community conversations.
- 🔍 Semantic Search - Vector similarity search across 2.97M+ Reddit posts
- ⚡ Lightning Fast - Powered by Groq's llama-3.3-70b-versatile for sub-second responses
- 🤖 AI Insights - Context-aware answer generation with source attribution
- 📊 Real-time Stats - Live API usage tracking and dataset statistics
- 🎨 Beautiful UI - Modern React interface with gradient glassmorphic design
🔗 https://shams0026-canadaconvo-backend.hf.space
Try these endpoints:
- 📚 API Docs: /docs
- ❤️ Health Check: /api/v1/health
- 📊 Stats: /api/v1/stats
curl -X POST "https://shams0026-canadaconvo-backend.hf.space/api/v1/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What are Canadians saying about healthcare?",
"top_k": 10
}'
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
│ FRONTEND (React + Vite) │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chat │ │ Message │ │ Source │ │
│ │ Interface │→ │ Display │→ │ Cards │ │
│ └────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ │
│ API Client (services/api.ts) │
└─────────────────────────────────────────────────────────────┘
↓ HTTPS
┌─────────────────────────────────────────────────────────────┐
│ BACKEND (FastAPI on HF Space) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ API Layer (/api/v1) │ │
│ │ /query | /health | /stats │ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ RAG Engine │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Embed │→ │ Retrieve │→ │ Generate │ │ │
│ │ │ Query │ │ Context │ │ Response │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Groq API │ │ ChromaDB │ │ Embeddings │ │
│ │ (Primary) │ │ Vector DB │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Step-by-Step:
- Query Embedding → Convert user query to 384-dim vector using sentence-transformers
- Vector Search → Find top-K similar posts from 2.97M embeddings in ChromaDB
- Context Building → Structure retrieved posts into a comprehensive prompt
- LLM Generation → Groq's llama-3.3-70b generates insights from context
- Response Formatting → Return answer with source citations and metadata
| Technology | Purpose | Version |
|---|---|---|
| FastAPI | Web framework | 0.104.1 |
| ChromaDB | Vector database | 0.4.18 |
| Groq API | LLM inference (Primary) | llama-3.3-70b-versatile |
| Gemini API | LLM inference (Fallback) | gemini-2.5-flash |
| sentence-transformers | Text embeddings | all-MiniLM-L6-v2 (384-dim) |
| Uvicorn | ASGI server | Latest |
| Pydantic | Data validation | 2.x |
| Hugging Face Spaces | Deployment | Docker SDK |
| Technology | Purpose | Version |
|---|---|---|
| React | UI framework | 19.2 |
| TypeScript | Type safety | 5.9 |
| Vite | Build tool | 7.2.4 |
| Tailwind CSS | Styling | 3.4.1 |
| Lucide React | Icons | Latest |
- Python scripts for ETL
- Pandas & NumPy for data processing
- ChromaDB persistent vector store (16GB)
- Pre-computed embeddings for 2.97M posts
- Source: r/Canada subreddit
- Total Posts: 2,972,749
- Time Period: Historical discussions
- Embedding Dimensions: 384
- Vector Database Size: 16GB
- Metadata: Titles, scores, comments, years
{
"total_posts": 2972749,
"embedding_dimension": 384,
"vector_db_size": "16GB",
"model": "all-MiniLM-L6-v2",
"database": "ChromaDB"
}- Python 3.12+
- Node.js 18+
- Git
- 8GB+ RAM (for running locally with full dataset)
# 1. Clone repository
git clone https://github.com/Shams261/Verity.git
cd Verity/canadaconvo-backend
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env
# Edit .env and add your API keys:
# - GROQ_API_KEY=your_groq_key_here
# - HACKATHON_API_KEY=your_gemini_key_here (optional fallback)
# 5. Start server
python app.pyBackend will be running at http://localhost:8000
API Documentation: http://localhost:8000/docs
# 1. Navigate to frontend
cd ../canadaconvo-frontend
# 2. Install dependencies
npm install
# 3. Configure API endpoint (if needed)
# Create .env.local with:
# VITE_API_URL=http://localhost:8000/api/v1
# 4. Start development server
npm run devFrontend will be running at http://localhost:5173
# LLM Configuration
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=llama-3.3-70b-versatile
USE_GROQ=true
# Fallback LLM (optional)
HACKATHON_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash
# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_DIM=384
# RAG Settings
RETRIEVAL_TOP_K=50
RERANK_TOP_K=20
# Server
HOST=0.0.0.0
PORT=8000# Backend API URL
VITE_API_URL=https://shams0026-canadaconvo-backend.hf.space/api/v1https://shams0026-canadaconvo-backend.hf.space/api/v1
POST /query
Submit a natural language query to analyze Canadian discourse.
Request:
{
"query": "What are Canadians saying about housing affordability?",
"top_k": 20
}Response:
{
"query": "What are Canadians saying about housing affordability?",
"answer": "Based on analyzing discussions from r/Canada...",
"sources": [
{
"id": "post_123",
"title": "Housing crisis discussion",
"text": "Post content...",
"score": 1250,
"year": "2023",
"num_comments": 342
}
],
"metadata": {
"num_sources": 20,
"cached": false,
"api_calls_used": 4,
"api_calls_remaining": 999999
}
}GET /health
Check service status.
Response:
{
"status": "healthy",
"service": "CanadaConvo",
"version": "1.0.0"
}GET /stats
Get system statistics.
Response:
{
"total_posts": 2972749,
"api_calls_used": 4,
"api_calls_remaining": 999999,
"embedding_model": "all-MiniLM-L6-v2",
"llm_provider": "Groq",
"llm_model": "llama-3.3-70b-versatile",
"using_groq": true
}Verity/
├── canadaconvo-backend/ # Python FastAPI Backend
│ ├── app/
│ │ ├── api/ # API endpoints
│ │ │ └── v1/
│ │ │ └── endpoints/
│ │ │ ├── query.py # POST /query
│ │ │ └── health.py # GET /health, /stats
│ │ ├── core/ # Core business logic
│ │ │ └── rag_engine.py # Main RAG implementation
│ │ ├── db/ # Database clients
│ │ │ └── chroma_client.py # ChromaDB integration
│ │ ├── models/ # Pydantic models
│ │ │ ├── query.py # Request/Response models
│ │ │ └── post.py # Post model
│ │ ├── services/ # External services
│ │ │ ├── embedding_service.py # Embeddings
│ │ │ ├── groq_service.py # Groq API client
│ │ │ └── gemini_service.py # Gemini API client
│ │ └── utils/ # Utilities
│ │ └── data_loader.py # Data management
│ ├── config/
│ │ └── settings.py # Configuration
│ ├── data/ # Data files (16GB+)
│ │ ├── processed/
│ │ │ ├── embeddings.npy # Pre-computed embeddings
│ │ │ └── metadata.parquet # Post metadata
│ │ └── chroma/ # ChromaDB persistent storage
│ ├── scripts/ # Data processing scripts
│ │ ├── 02_clean_data.py
│ │ ├── 03_generate_embeddings.py
│ │ └── 04_build_indexes.py
│ ├── .env # Environment variables (gitignored)
│ ├── .env.example # Template for env vars
│ ├── Dockerfile # Docker configuration
│ ├── requirements.txt # Python dependencies
│ ├── app.py # Server entry point
│ └── README.md # Backend documentation
│
├── canadaconvo-frontend/ # React + TypeScript Frontend
│ ├── src/
│ │ ├── components/
│ │ │ ├── ChatInterface.tsx # Main chat UI
│ │ │ ├── MessageDisplay.tsx # Message rendering
│ │ │ ├── SourceCard.tsx # Citation cards
│ │ │ └── StatsPanel.tsx # Header stats
│ │ ├── services/
│ │ │ └── api.ts # API client
│ │ ├── types/
│ │ │ └── index.ts # TypeScript types
│ │ ├── App.tsx # Root component
│ │ ├── main.tsx # Entry point
│ │ └── index.css # Global styles
│ ├── public/ # Static assets
│ ├── package.json
│ ├── tsconfig.json
│ ├── vite.config.ts # Vite configuration
│ └── tailwind.config.js # Tailwind config
│
├── .gitignore # Git ignore rules
└── README.md # This file
1. USER INPUT
└─ User types: "What issues concern Canadians most?"
2. FRONTEND
└─ POST /api/v1/query
3. BACKEND API
└─ Receives QueryRequest
4. RAG ENGINE
├─ Embedding Generation
│ └─ Convert query to 384-dim vector
│
├─ Vector Search
│ └─ Find top 20 similar posts in ChromaDB
│
├─ Context Building
│ └─ Structure prompt with retrieved posts
│
├─ LLM Generation
│ └─ Groq API generates insights
│
└─ Response Formatting
└─ Return {answer, sources, metadata}
5. FRONTEND DISPLAY
├─ Render AI answer
├─ Show source cards
└─ Display metadata
6. USER SEES RESULT
└─ Comprehensive answer with citations
| Metric | Value |
|---|---|
| Dataset Size | 2.97M posts |
| Embedding Dimension | 384 |
| First Query | ~3-5s (with Groq) |
| Cached Query | <1s |
| Vector Search | ~100ms |
| LLM Inference | ~2-3s (Groq) |
| Default Top-K | 20 posts |
Uses sentence-transformers to convert text into 384-dimensional vectors, enabling semantic understanding beyond keyword matching.
Leverages Groq's ultra-fast LLM infrastructure for sub-second response generation with llama-3.3-70b-versatile.
Every answer includes citations to original Reddit posts, ensuring transparency and verifiability.
Frequently asked questions are cached to provide instant responses and reduce API costs.
Live dashboard showing:
- Total posts in dataset (2.97M)
- API usage tracking
- Active LLM provider (Groq/Gemini)
The backend is deployed on Hugging Face Spaces using Docker:
title: Verity Backend API
emoji: 🍁
sdk: docker
app_port: 7860Secrets Configuration:
GROQ_API_KEY- Your Groq API keyHACKATHON_API_KEY- Gemini API key (optional)HF_TOKEN- Hugging Face token for dataset access
The frontend can be deployed on:
- Vercel (Recommended)
- Netlify
- GitHub Pages
- Any static hosting service
Environment Variable:
VITE_API_URL=https://shams0026-canadaconvo-backend.hf.space/api/v1- ✅ All API keys stored in environment variables
- ✅
.envfiles gitignored - ✅ No hardcoded credentials in code
- ✅ Secrets managed via HF Space settings
Backend allows requests from:
localhost:3000(React dev)localhost:5173(Vite dev)*.vercel.app(Vercel deployments)*.hf.space(HF Space frontend)
We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
feat:New featurefix:Bug fixdocs:Documentation changesstyle:Code style changesrefactor:Code refactoringtest:Test additions/changeschore:Build process or tool changes
Problem: ChromaDB collection not found
Solution: Ensure data files are downloaded
The app will auto-download on first runProblem: Groq API errors
Solution: Check your GROQ_API_KEY in .env
Verify you have API credits remainingProblem: Out of memory
Solution: Reduce RETRIEVAL_TOP_K in config
Or increase available RAM (8GB+ recommended)Problem: API connection refused
Solution: Verify VITE_API_URL in .env.local
Check backend is runningProblem: Build errors
Solution: Clear node_modules and reinstall
rm -rf node_modules package-lock.json
npm installThis project was created for the AI Hackathon in the North 2026.
The AI Collective - Thunder Bay
- The AI Collective - Thunder Bay - For organizing the AI Hackathon in the North
- Hackathon Organizers - For providing API access and resources
- r/Canada Community - For the rich discussion dataset
- Groq - For ultra-fast LLM inference
- ChromaDB - For efficient vector storage
- Sentence Transformers - For high-quality embeddings
- FastAPI - For the excellent Python web framework
- React Team - For the amazing frontend library
- Hugging Face - For hosting infrastructure
Built with ❤️ for the AI Hackathon in the North 2026
For questions, suggestions, or feedback:
- GitHub Issues: Create an issue
- Repository: Shams261/Verity
If you find this project useful, please consider giving it a ⭐!