LegalScan is a full-stack intelligent web app that classifies uploaded documents as formal proposals or non-proposals, using a hybrid of vector similarity (FAISS), heuristic rules, and LLM-based summarization. Built for legal professionals.
-
🔍 Proposal Classification
Uses FAISS and keyword-based heuristics to classify documents asPROPOSAL,MAYBE_PROPOSAL, orNON_PROPOSAL. -
📋 LLM Summarization
Generates concise summaries of documents labeled as proposals. -
📊 Rule Breakdown Display
Explains scoring based on headers, keyword presence, structure, and submission terms. -
🗂 Historical Dashboard
Scanned documents are saved in a local database and accessible via a searchable UI. -
☁️ Hybrid OCR Support
Choose between local Tesseract OCR or AWS Textract for PDF/image parsing. -
🛡️ Secure, Modular Design
Environment-controlled secrets, RESTful architecture, SQLite or PostgreSQL-ready.
Upload a government PDF. The app will:
- Extract text (local or cloud).
- Predict if it's a proposal.
- Show a summary and rule score.
- Log results to a dashboard for future reference.
git clone https://github.com/your-username/legal-scan.git
cd legal-scan/backend
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtFLASK_SECRET_KEY=your-secret-key
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-textract-bucketpython app.pyVisit: http://localhost:5000
Each prediction is determined by:
- Vector similarity score (FAISS-trained embeddings)
- Rule-based boosters:
- ✅ Header match (
Request for Proposal, etc.) - ✅ Keyword density (budget, deliverables, etc.)
- ✅ Structured sections (e.g., "Scope of Work", "Evaluation")
- ✅ Submission language ("submit by", "deadline", etc.)
- ✅ Header match (
| Layer | Stack |
|---|---|
| Backend | Flask, SQLAlchemy |
| Frontend | Jinja2 + HTML/CSS |
| OCR | Tesseract (local) + AWS Textract (S3) |
| Classifier | FAISS + Regex Heuristics |
| Summary | LLM via Ollama/OpenAI/HuggingFace |
| Storage | SQLite (default) |
backend/
├── app.py # Flask app entry
├── models.py # SQLAlchemy model
├── templates/ # Jinja2 templates
├── static/ # CSS & assets
├── faiss_utils.py # Vector search classifier
├── ocr_utils.py # Local Tesseract extraction
├── textract_s3_utils.py # AWS Textract integration
├── llm_utils.py # LLM summary agent
├── qualifier_rules.py # Heuristic rule scoring
Prediction: ✅ PROPOSAL (confidence: 0.87)
Matched Rules:
- keywords: ✅
- header: ✅
- structure: ✅
- submission: ❌
Summary:
This RFP outlines requirements for immigration legal services...
- Combines NLP, LLMs, and classic AI for explainable document classification
- Modular, production-ready codebase suitable for legal tech, GovTech, and AI/ML portfolios
- UX designed around decision making, not just classification
