LEX: Legal and Exposure Expert

Production-grade RAG pipeline and UI for exploring EU legislation, national laws, and international standards. This monorepo contains the data pipeline, backend API, and a React/Vite frontend (with mock mode for running without the backend).

🔎 Overview

End-to-end: preprocessing → embeddings → Vertex AI Vector Search index → API → UI
Multilingual embeddings (text-multilingual-embedding-002, 768-dim)
Rich metadata (year, doc type, source, article), paragraph indices
Frontend can run standalone in mock mode (no cloud credits required)

🚀 Quick Start

Frontend

UI to interact with the backend and explore the legislations in multiple ways.

cd frontend
npm install
VITE_USE_MOCK=true npm run dev

Open http://localhost:3000 and use the left sidebar to browse categories and subcategories. If you see a white screen, open DevTools and check the Console. To surface runtime errors, wrap App with an error boundary (already included in the codebase). If you later connect to an API, set VITE_API_URL.

Backend API

For this part you need to have Google Cloud access and credits.

# Python environment (macOS)
python -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt

# Start FastAPI (choose one)
python backend/api_server.py
# or
uvicorn backend.api_server:app --host 0.0.0.0 --port 8000

# Frontend → point to API
cd frontend
export VITE_API_URL=http://localhost:8000
npm run dev

Notes:

The backend is a FastAPI server that exposes endpoints for regulations and analysis.
CORS is permissive by default for local development.

Data & Pipeline (local run)

The pipeline processes documents, generates embeddings, and builds a Vertex AI Vector Search index.

# Install core requirements
pip install -r requirements.txt

# Preprocess documents (local, no upload)
python scripts/preprocessing/preprocess_local.py \
  --config config.yaml \
  --skip-upload

# Generate embeddings
python scripts/embeddings/generate_embeddings.py \
  --input-prefix processed_chunks/ \
  --output-prefix embeddings_vertexai/

# Build index (requires GCP setup)
python scripts/deployment/build_vector_index.py \
  --embeddings-prefix embeddings_vertexai/ \
  --index-display-name eu-legislation-index

Pipeline summary:

Preprocessing produces chunked text with paragraph indices.
Embeddings are generated in Vertex AI-compatible format (768-dim).
Index is built in Vertex AI Vector Search with namespaces for filtering.

✨ Features

Paragraph indices for precise excerpt extraction
Multi-source corpus: EU legislation, national laws (FI), international standards (Basel, IFRS)
Vertex AI Vector Search formatted embeddings with namespaces (year, doc_type, source_type)
UI for overlaps and contradictions visualization (with optional HTML/network views)

Performance and testing:

Embedding performance is optimized around 1,200 target tokens per chunk.
Tests validate preprocessing correctness and Vertex AI output compliance.

🔧 Configuration

Configuration essentials (YAML example):

gcp:
  bucket_name: "your-bucket"       # e.g., EU West 1
  output_prefix: "processed_chunks"

processing:
  chunk_target_tokens: 1200
  min_chunk_tokens: 400
  max_chunk_tokens: 1800
  input_directories:
    - "output"                    # EU legislation
    - "other_national_laws"       # National laws
    - "other_regulation_standards"# International standards

🧪 Testing & Validation

# Comprehensive preprocessing test
python scripts/testing/test_comprehensive.py

# Validate Vertex AI format
python scripts/testing/test_embedding_format.py

# End-to-end pipeline validation
python scripts/testing/validate_pipeline.py

📦 Repository Structure (brief)

frontend/ — frontend/README.md React/Vite UI
backend/ — backend/README.md FastAPI server endpoints
scripts/ — scripts/README.md preprocessing, embeddings, deployment, and tests
docs/ — user-facing reports and supplementary guides
deployment/ — deployment helper scripts (optional)

🧭 Project Structure

.
├── README.md
├── config.yaml
├── Dockerfile
├── requirements.txt
├── QUICK_REFERENCE.md
├── CONFIG_QUICK_REF.md
│
├── backend/
│   ├── api_server.py
│   ├── cache_db.py
│   ├── rag_search.py
│   ├── Dockerfile
│   └── requirements.txt
│
├── frontend/
│   ├── index.html
│   ├── package.json
│   ├── vite.config.ts
│   └── src/
│       ├── main.tsx
│       ├── App.tsx
│       ├── api/
│       ├── components/
│       ├── data/
│       ├── styles/
│       └── types/
│
├── scripts/                            
│   ├── README.md 
│   ├── requirements.txt                     
│   │
│   ├── preprocessing/                 
│   │   ├── preprocess_local.py        
│   │   └── preprocess_and_upload.py    
│   │
│   ├── embeddings/                     
│   │   ├── generate_embeddings.py      
│   │   └── generate_embeddings_parallel.py 
│   │
│   ├── deployment/                     
│   │   ├── build_vector_index.py       
│   │   ├── deploy_quick.py             
│   │   └── check_deployment.py         
│   │
│   ├── testing/                        
│   │   ├── test_comprehensive.py       
│   │   ├── test_embedding_format.py    
│   │   ├── test_preprocessing.py       
│   │   └── validate_pipeline.py        
│   │
│   └── utilities/                      
│       ├── extract_paragraphs.py       
│       ├── rag_search.py               
│       ├── metadata_store.py           
│       └── monitor_build.sh            
│
├── deployment/
│   ├── README.md
│   ├── deploy-backend.sh
│   ├── deploy-frontend.sh
│   └── setup-deployment.sh
│
├── docs/
│   ├── QUICK_START.md
│   ├── IMPLEMENTATION_GUIDE.md
│   ├── VERTEX_AI_INTEGRATION.md
│   └── (additional reports and guides)
│
└── data/
    └── AllRiskCategories.json

📊 Current Data Status

Documents processed: 61,072
Total chunks: 334,000+ (≈919 MB)

🔐 Compliance & Regions

Region: EU West 1 (europe-west1)
Bucket: uniform access with retention policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LEX: Legal and Exposure Expert

🔎 Overview

🚀 Quick Start

Frontend

Backend API

Data & Pipeline (local run)

✨ Features

🔧 Configuration

🧪 Testing & Validation

📦 Repository Structure (brief)

🧭 Project Structure

📊 Current Data Status

🔐 Compliance & Regions

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
backend		backend
data		data
deployment		deployment
docs		docs
frontend		frontend
scripts		scripts
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml

behramulukir/junction-2025

Folders and files

Latest commit

History

Repository files navigation

LEX: Legal and Exposure Expert

🔎 Overview

🚀 Quick Start

Frontend

Backend API

Data & Pipeline (local run)

✨ Features

🔧 Configuration

🧪 Testing & Validation

📦 Repository Structure (brief)

🧭 Project Structure

📊 Current Data Status

🔐 Compliance & Regions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages