A comprehensive Retrieval-Augmented Generation (RAG) system for analyzing the evolution of the German language across historical periods using the German Manchester Corpus (GerManC). This project combines traditional Natural Language Processing with modern vector databases and semantic search to enable sophisticated linguistic research.
The German language has undergone remarkable transformations over the past millennium. From Middle High German texts of the 12th century to modern standardized German, linguistic patterns, vocabulary, and grammatical structures have evolved continuously. Understanding these changes requires analyzing vast amounts of historical textsβa task perfectly suited for modern computational linguistics.
This project was born from the need to create an intelligent system that could:
- Automatically process thousands of historical German texts
- Extract linguistic features across different time periods and genres
- Enable semantic search through centuries of language evolution
- Provide intelligent answers about historical language patterns
The result is a production-ready RAG system that transforms raw historical texts into an interactive knowledge base, allowing researchers to ask natural language questions about German language evolution and receive contextually-aware answers backed by primary source material.
- Vector-based similarity search across historical texts
- Period-specific filtering (1050-2000 CE)
- Genre-aware analysis (Legal, Scientific, Literary, etc.)
- Multi-strategy word evolution tracking
- Natural language queries about historical German
- Context-aware responses with source citations
- Temporal analysis of linguistic phenomena
- Support for both simple retrieval and LLM-powered generation
- Spelling variant analysis across time periods
- Word frequency evolution tracking
- Linguistic feature extraction and comparison
- Statistical insights into language change patterns
- Modular 6-phase processing pipeline
- PostgreSQL for structured linguistic data
- ChromaDB for vector embeddings
- FastAPI REST endpoints
- Comprehensive validation and logging
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Raw Texts β -> β GATE Pipeline β -> β PostgreSQL β
β (GerManC Corpus)β β (NLP Features) β β (Structured) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web Interface β <- β FastAPI REST β <- β ChromaDB β
β (Coming Soon) β β API β β (Vector Store) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
historical-language-evolution-rag/
βββ π src/ # Source code directory
β βββ π Data Pipeline Modules
β β βββ germanc_organizer/ # Phase 1: File organization by period/genre
β β βββ gate_preprocessor/ # Phase 2: NLP feature extraction
β β βββ validation_suite/ # Phase 3: Quality assurance
β β βββ prepare_pipeline/ # Phase 4: Data chunking & preparation
β β
β βββ ποΈ Backend Systems
β β βββ access_pipeline/ # Phase 5: PostgreSQL setup & REST API
β β βββ rag_system/ # Phase 6: Vector DB & semantic search
β β
β βββ π Execution Scripts
β βββ organize.py # Execute Phase 1
β βββ preprocess.py # Execute Phase 2
β βββ validate.py # Execute Phase 3
β βββ prepare.py # Execute Phase 4
β βββ access.py # Execute Phase 5
β βββ rag.py # Execute Phase 6
β
βββ π data/ # Data storage
β βββ raw_corpus/ # Original GerManC files
β βββ organized_corpus/ # Phase 1 output
β βββ preprocessed/ # Phase 2 output
β βββ prepared/ # Phase 4 output
β
βββ π german_corpus_vectordb/ # ChromaDB vector storage
βββ π docs/ # Documentation
βββ π config/ # Configuration files
βββ π tests/ # Unit and integration tests
βββ π utils/ # Utility scripts
βββ π Notebook/ # Jupyter analysis notebooks
β
βββ π§ Project Files
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Project configuration
βββ setup.py # Package setup
βββ README.md # This documentation
# System requirements
Python 3.9+
PostgreSQL 12+
Git LFS (for large corpus files)
# Required disk space
~10GB for full GerManC corpus
~5GB for processed embeddings- Clone the repository
git clone https://github.com/yourusername/historical-language-evolution-rag.git
cd historical-language-evolution-rag- Set up Python environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt- Configure PostgreSQL
# Create database
createdb germanc_corpus
# Update database config in access_pipeline/config.py
# Set your PostgreSQL credentials- Download GerManC Corpus
# Place your GerManC corpus files in:
mkdir data/raw_corpus
# Copy .txt files organized by period folderscd src
python organize.py ../data/raw_corpus ../data/organized_corpusWhat it does: Sorts historical texts by time period and genre, validates file structure, creates metadata catalogs.
cd src
python preprocess.py ../data/organized_corpus ../data/preprocessedWhat it does: Runs GATE NLP pipeline to extract linguistic features, normalizes historical spelling, performs POS tagging and tokenization.
cd src
python validate.py ../data/preprocessedWhat it does: Validates preprocessing quality, checks feature extraction completeness, generates quality reports.
cd src
python prepare.py ../data/preprocessed ../data/preparedWhat it does: Creates text chunks optimized for retrieval, builds word frequency tables, prepares database import files.
cd src
python access.py ../data/prepared --start-apiWhat it does: Imports data into PostgreSQL, creates optimized indexes, starts REST API server for data access.
cd src
python rag.py --test --limit 1000What it does: Creates vector embeddings, sets up ChromaDB, enables semantic search and question-answering.
# src/access_pipeline/config.py
DEFAULT_DB_CONFIG = {
'host': 'localhost',
'port': 5432,
'database': 'germanc_corpus',
'user': 'your_username',
'password': 'your_password'
}# src/rag_system/config.py
EMBEDDING_MODEL = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
DEFAULT_VECTOR_DB_PATH = "./chroma_db"import sys
sys.path.append('src')
from rag_system import GermanRAGPipeline
rag = GermanRAGPipeline(db_config, "./vectordb")
rag.setup_qa_system()
# Search for language patterns
results = rag.semantic_search("mittelalterliche deutsche sprache", k=5)
# Period-specific search
results = rag.semantic_search("rechtliche begriffe", period_filter="1350-1650")# Ask about language evolution
answer = rag.ask_question("Wie entwickelte sich die deutsche Rechtsprache im Mittelalter?")
print(answer['answer'])
print("Sources:", [doc['metadata'] for doc in answer['source_documents']])# Track word changes across periods
evolution = rag.analyze_language_evolution("recht", periods=["1350-1650", "1650-1800"])
for period, data in evolution['periods'].items():
print(f"{period}: {data['context_count']} contexts found")# Start API server
cd src
python access.py ../data/prepared --start-api
# Query endpoints
curl "http://localhost:8000/search/mittelalterliche%20sprache?period=1350-1650"
curl "http://localhost:8000/evolution/recht/1350-1650/1650-1800"- chunks: Text segments with metadata (period, genre, word_count)
- spelling_variants: Historical spelling variations with normalizations
- word_frequencies: Term frequency across periods and genres
- linguistic_features: Extracted grammatical and syntactic features
- ChromaDB Collection: Semantic embeddings of text chunks with metadata
- Embedding Model: Multilingual sentence transformer (384 dimensions)
- Search Index: Optimized for similarity queries and metadata filtering
# Run full test suite
python -m pytest tests/
# Test individual phases (from src directory)
cd src
python validate.py ../data/preprocessed --test-mode
python rag.py --test --limit 100
# Integration tests
python tests/test_full_pipeline.py- Corpus Size: ~50,000 historical German texts
- Time Coverage: 1050-2000 CE (950 years)
- Processing Speed: ~500 texts/minute (Phase 2)
- Search Latency: <100ms for semantic queries
- Embedding Creation: ~1000 chunks/minute
- Database Size: ~5GB (structured) + ~2GB (vectors)
- Track phonological changes across centuries
- Analyze syntactic evolution patterns
- Study lexical semantic shifts
- Explore genre-specific language patterns
- Investigate sociolinguistic variations
- Support comparative historical analysis
- Benchmark historical NLP tools
- Develop diachronic language models
- Test cross-temporal retrieval systems
- Interactive web dashboard
- Visual timeline of language changes
- Advanced query builder
- Export functionality for research data
- Multi-language support (Historical English, Latin)
- Advanced temporal modeling
- Integration with linguistic databases
- Machine learning-based change detection
- API versioning and rate limiting
We welcome contributions from linguists, historians, and developers!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add unit tests for new features
- Update documentation for API changes
- Use type hints in Python code
- Write descriptive commit messages
- API Documentation - REST endpoint specifications
- Architecture Guide - System design details
- Linguistic Features - NLP processing overview
- Database Schema - Data structure documentation
- Deployment Guide - Production setup instructions
"Numpy is not available" error
pip install numpy==1.24.3 sentence-transformers==2.7.0PostgreSQL connection errors
# Check PostgreSQL service
sudo service postgresql status
# Update connection config in access_pipeline/config.pyGATE processing failures
# Verify Java installation
java -version
# Check GATE installation in gate_preprocessor/- π§ Email: [your-email@example.com]
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
This project is licensed under the MIT License - see the LICENSE file for details.
- GerManC Corpus Team - For providing the historical German text corpus
- GATE Development Team - For the robust NLP processing framework
- Sentence Transformers - For multilingual embedding models
- ChromaDB Team - For the efficient vector database
- Digital Humanities Community - For inspiration and research collaboration
If you use this system in your research, please cite:
@software{german_rag_system,
title={Historical German Language Evolution RAG System},
author={Rohan Dhupar},
year={2024},
url={https://github.com/yourusername/historical-language-evolution-rag},
note={A comprehensive system for analyzing German language evolution using RAG}
}Built with β€οΈ for Historical Linguistics Research