Production-grade RAG pipeline and UI for exploring EU legislation, national laws, and international standards. This monorepo contains the data pipeline, backend API, and a React/Vite frontend (with mock mode for running without the backend).
- End-to-end: preprocessing → embeddings → Vertex AI Vector Search index → API → UI
- Multilingual embeddings (text-multilingual-embedding-002, 768-dim)
- Rich metadata (year, doc type, source, article), paragraph indices
- Frontend can run standalone in mock mode (no cloud credits required)
UI to interact with the backend and explore the legislations in multiple ways.
cd frontend
npm install
VITE_USE_MOCK=true npm run devOpen http://localhost:3000 and use the left sidebar to browse categories and subcategories. If you see a white screen, open DevTools and check the Console. To surface runtime errors, wrap App with an error boundary (already included in the codebase). If you later connect to an API, set VITE_API_URL.
For this part you need to have Google Cloud access and credits.
# Python environment (macOS)
python -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt
# Start FastAPI (choose one)
python backend/api_server.py
# or
uvicorn backend.api_server:app --host 0.0.0.0 --port 8000
# Frontend → point to API
cd frontend
export VITE_API_URL=http://localhost:8000
npm run devNotes:
- The backend is a FastAPI server that exposes endpoints for regulations and analysis.
- CORS is permissive by default for local development.
The pipeline processes documents, generates embeddings, and builds a Vertex AI Vector Search index.
# Install core requirements
pip install -r requirements.txt
# Preprocess documents (local, no upload)
python scripts/preprocessing/preprocess_local.py \
--config config.yaml \
--skip-upload
# Generate embeddings
python scripts/embeddings/generate_embeddings.py \
--input-prefix processed_chunks/ \
--output-prefix embeddings_vertexai/
# Build index (requires GCP setup)
python scripts/deployment/build_vector_index.py \
--embeddings-prefix embeddings_vertexai/ \
--index-display-name eu-legislation-indexPipeline summary:
- Preprocessing produces chunked text with paragraph indices.
- Embeddings are generated in Vertex AI-compatible format (768-dim).
- Index is built in Vertex AI Vector Search with namespaces for filtering.
- Paragraph indices for precise excerpt extraction
- Multi-source corpus: EU legislation, national laws (FI), international standards (Basel, IFRS)
- Vertex AI Vector Search formatted embeddings with namespaces (year, doc_type, source_type)
- UI for overlaps and contradictions visualization (with optional HTML/network views)
Performance and testing:
- Embedding performance is optimized around 1,200 target tokens per chunk.
- Tests validate preprocessing correctness and Vertex AI output compliance.
Configuration essentials (YAML example):
gcp:
bucket_name: "your-bucket" # e.g., EU West 1
output_prefix: "processed_chunks"
processing:
chunk_target_tokens: 1200
min_chunk_tokens: 400
max_chunk_tokens: 1800
input_directories:
- "output" # EU legislation
- "other_national_laws" # National laws
- "other_regulation_standards"# International standards# Comprehensive preprocessing test
python scripts/testing/test_comprehensive.py
# Validate Vertex AI format
python scripts/testing/test_embedding_format.py
# End-to-end pipeline validation
python scripts/testing/validate_pipeline.pyfrontend/— frontend/README.md React/Vite UIbackend/— backend/README.md FastAPI server endpointsscripts/— scripts/README.md preprocessing, embeddings, deployment, and testsdocs/— user-facing reports and supplementary guidesdeployment/— deployment helper scripts (optional)
.
├── README.md
├── config.yaml
├── Dockerfile
├── requirements.txt
├── QUICK_REFERENCE.md
├── CONFIG_QUICK_REF.md
│
├── backend/
│ ├── api_server.py
│ ├── cache_db.py
│ ├── rag_search.py
│ ├── Dockerfile
│ └── requirements.txt
│
├── frontend/
│ ├── index.html
│ ├── package.json
│ ├── vite.config.ts
│ └── src/
│ ├── main.tsx
│ ├── App.tsx
│ ├── api/
│ ├── components/
│ ├── data/
│ ├── styles/
│ └── types/
│
├── scripts/
│ ├── README.md
│ ├── requirements.txt
│ │
│ ├── preprocessing/
│ │ ├── preprocess_local.py
│ │ └── preprocess_and_upload.py
│ │
│ ├── embeddings/
│ │ ├── generate_embeddings.py
│ │ └── generate_embeddings_parallel.py
│ │
│ ├── deployment/
│ │ ├── build_vector_index.py
│ │ ├── deploy_quick.py
│ │ └── check_deployment.py
│ │
│ ├── testing/
│ │ ├── test_comprehensive.py
│ │ ├── test_embedding_format.py
│ │ ├── test_preprocessing.py
│ │ └── validate_pipeline.py
│ │
│ └── utilities/
│ ├── extract_paragraphs.py
│ ├── rag_search.py
│ ├── metadata_store.py
│ └── monitor_build.sh
│
├── deployment/
│ ├── README.md
│ ├── deploy-backend.sh
│ ├── deploy-frontend.sh
│ └── setup-deployment.sh
│
├── docs/
│ ├── QUICK_START.md
│ ├── IMPLEMENTATION_GUIDE.md
│ ├── VERTEX_AI_INTEGRATION.md
│ └── (additional reports and guides)
│
└── data/
└── AllRiskCategories.json
- Documents processed: 61,072
- Total chunks: 334,000+ (≈919 MB)
- Region: EU West 1 (europe-west1)
- Bucket: uniform access with retention policy