Skip to content

Historical Language Evolution RAG AI assistant analyzing German language evolution (1600-1900) using historical texts. Answers questions like "How did German word order change?" with cited evidence from historical documents. Uses TEI-XML processing, PostgreSQL + vector DB, and RAG+GPT-4 to make linguistic research interactive.

Notifications You must be signed in to change notification settings

rohandhupar1996/historical-language-evolution-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Historical German Language Evolution RAG System

Python License Status Pipeline

A comprehensive Retrieval-Augmented Generation (RAG) system for analyzing the evolution of the German language across historical periods using the German Manchester Corpus (GerManC). This project combines traditional Natural Language Processing with modern vector databases and semantic search to enable sophisticated linguistic research.

πŸ“– Project Story

The German language has undergone remarkable transformations over the past millennium. From Middle High German texts of the 12th century to modern standardized German, linguistic patterns, vocabulary, and grammatical structures have evolved continuously. Understanding these changes requires analyzing vast amounts of historical textsβ€”a task perfectly suited for modern computational linguistics.

This project was born from the need to create an intelligent system that could:

  • Automatically process thousands of historical German texts
  • Extract linguistic features across different time periods and genres
  • Enable semantic search through centuries of language evolution
  • Provide intelligent answers about historical language patterns

The result is a production-ready RAG system that transforms raw historical texts into an interactive knowledge base, allowing researchers to ask natural language questions about German language evolution and receive contextually-aware answers backed by primary source material.

🎯 Key Features

πŸ” Intelligent Semantic Search

  • Vector-based similarity search across historical texts
  • Period-specific filtering (1050-2000 CE)
  • Genre-aware analysis (Legal, Scientific, Literary, etc.)
  • Multi-strategy word evolution tracking

πŸ€– Question-Answering System

  • Natural language queries about historical German
  • Context-aware responses with source citations
  • Temporal analysis of linguistic phenomena
  • Support for both simple retrieval and LLM-powered generation

πŸ“Š Comprehensive Analytics

  • Spelling variant analysis across time periods
  • Word frequency evolution tracking
  • Linguistic feature extraction and comparison
  • Statistical insights into language change patterns

πŸ—οΈ Production Architecture

  • Modular 6-phase processing pipeline
  • PostgreSQL for structured linguistic data
  • ChromaDB for vector embeddings
  • FastAPI REST endpoints
  • Comprehensive validation and logging

πŸ› οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Raw Texts     β”‚ -> β”‚  GATE Pipeline  β”‚ -> β”‚   PostgreSQL    β”‚
β”‚ (GerManC Corpus)β”‚    β”‚ (NLP Features)  β”‚    β”‚  (Structured)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Interface β”‚ <- β”‚   FastAPI REST  β”‚ <- β”‚    ChromaDB     β”‚
β”‚  (Coming Soon)  β”‚    β”‚      API        β”‚    β”‚ (Vector Store)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

historical-language-evolution-rag/
β”œβ”€β”€ πŸ“‚ src/                        # Source code directory
β”‚   β”œβ”€β”€ πŸ“Š Data Pipeline Modules
β”‚   β”‚   β”œβ”€β”€ germanc_organizer/     # Phase 1: File organization by period/genre
β”‚   β”‚   β”œβ”€β”€ gate_preprocessor/     # Phase 2: NLP feature extraction  
β”‚   β”‚   β”œβ”€β”€ validation_suite/      # Phase 3: Quality assurance
β”‚   β”‚   └── prepare_pipeline/      # Phase 4: Data chunking & preparation
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ—„οΈ Backend Systems
β”‚   β”‚   β”œβ”€β”€ access_pipeline/       # Phase 5: PostgreSQL setup & REST API
β”‚   β”‚   └── rag_system/           # Phase 6: Vector DB & semantic search
β”‚   β”‚
β”‚   └── πŸ“ Execution Scripts
β”‚       β”œβ”€β”€ organize.py           # Execute Phase 1
β”‚       β”œβ”€β”€ preprocess.py         # Execute Phase 2
β”‚       β”œβ”€β”€ validate.py           # Execute Phase 3
β”‚       β”œβ”€β”€ prepare.py            # Execute Phase 4
β”‚       β”œβ”€β”€ access.py             # Execute Phase 5
β”‚       └── rag.py                # Execute Phase 6
β”‚
β”œβ”€β”€ πŸ“‚ data/                       # Data storage
β”‚   β”œβ”€β”€ raw_corpus/               # Original GerManC files
β”‚   β”œβ”€β”€ organized_corpus/         # Phase 1 output
β”‚   β”œβ”€β”€ preprocessed/             # Phase 2 output  
β”‚   └── prepared/                 # Phase 4 output
β”‚
β”œβ”€β”€ πŸ“‚ german_corpus_vectordb/     # ChromaDB vector storage
β”œβ”€β”€ πŸ“‚ docs/                       # Documentation
β”œβ”€β”€ πŸ“‚ config/                     # Configuration files
β”œβ”€β”€ πŸ“‚ tests/                      # Unit and integration tests
β”œβ”€β”€ πŸ“‚ utils/                      # Utility scripts
β”œβ”€β”€ πŸ“‚ Notebook/                   # Jupyter analysis notebooks
β”‚
└── πŸ”§ Project Files
    β”œβ”€β”€ requirements.txt           # Python dependencies
    β”œβ”€β”€ pyproject.toml            # Project configuration
    β”œβ”€β”€ setup.py                  # Package setup
    └── README.md                 # This documentation

πŸš€ Quick Start

Prerequisites

# System requirements
Python 3.9+
PostgreSQL 12+
Git LFS (for large corpus files)

# Required disk space
~10GB for full GerManC corpus
~5GB for processed embeddings

Installation

  1. Clone the repository
git clone https://github.com/yourusername/historical-language-evolution-rag.git
cd historical-language-evolution-rag
  1. Set up Python environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
  1. Configure PostgreSQL
# Create database
createdb germanc_corpus

# Update database config in access_pipeline/config.py
# Set your PostgreSQL credentials
  1. Download GerManC Corpus
# Place your GerManC corpus files in:
mkdir data/raw_corpus
# Copy .txt files organized by period folders

πŸ“‹ Step-by-Step Execution

Phase 1: File Organization

cd src
python organize.py ../data/raw_corpus ../data/organized_corpus

What it does: Sorts historical texts by time period and genre, validates file structure, creates metadata catalogs.

Phase 2: Linguistic Preprocessing

cd src
python preprocess.py ../data/organized_corpus ../data/preprocessed

What it does: Runs GATE NLP pipeline to extract linguistic features, normalizes historical spelling, performs POS tagging and tokenization.

Phase 3: Quality Validation

cd src
python validate.py ../data/preprocessed

What it does: Validates preprocessing quality, checks feature extraction completeness, generates quality reports.

Phase 4: Data Preparation

cd src
python prepare.py ../data/preprocessed ../data/prepared

What it does: Creates text chunks optimized for retrieval, builds word frequency tables, prepares database import files.

Phase 5: Database & API Setup

cd src
python access.py ../data/prepared --start-api

What it does: Imports data into PostgreSQL, creates optimized indexes, starts REST API server for data access.

Phase 6: RAG System Deployment

cd src
python rag.py --test --limit 1000

What it does: Creates vector embeddings, sets up ChromaDB, enables semantic search and question-answering.

Configuration

Database Configuration

# src/access_pipeline/config.py
DEFAULT_DB_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'germanc_corpus',
    'user': 'your_username',
    'password': 'your_password'
}

Embedding Model Configuration

# src/rag_system/config.py
EMBEDDING_MODEL = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
DEFAULT_VECTOR_DB_PATH = "./chroma_db"

πŸ” Usage Examples

Semantic Search

import sys
sys.path.append('src')
from rag_system import GermanRAGPipeline

rag = GermanRAGPipeline(db_config, "./vectordb")
rag.setup_qa_system()

# Search for language patterns
results = rag.semantic_search("mittelalterliche deutsche sprache", k=5)

# Period-specific search
results = rag.semantic_search("rechtliche begriffe", period_filter="1350-1650")

Question Answering

# Ask about language evolution
answer = rag.ask_question("Wie entwickelte sich die deutsche Rechtsprache im Mittelalter?")
print(answer['answer'])
print("Sources:", [doc['metadata'] for doc in answer['source_documents']])

Language Evolution Analysis

# Track word changes across periods
evolution = rag.analyze_language_evolution("recht", periods=["1350-1650", "1650-1800"])
for period, data in evolution['periods'].items():
    print(f"{period}: {data['context_count']} contexts found")

REST API Usage

# Start API server
cd src
python access.py ../data/prepared --start-api

# Query endpoints
curl "http://localhost:8000/search/mittelalterliche%20sprache?period=1350-1650"
curl "http://localhost:8000/evolution/recht/1350-1650/1650-1800"

πŸ“Š Data Schema

PostgreSQL Tables

  • chunks: Text segments with metadata (period, genre, word_count)
  • spelling_variants: Historical spelling variations with normalizations
  • word_frequencies: Term frequency across periods and genres
  • linguistic_features: Extracted grammatical and syntactic features

Vector Database

  • ChromaDB Collection: Semantic embeddings of text chunks with metadata
  • Embedding Model: Multilingual sentence transformer (384 dimensions)
  • Search Index: Optimized for similarity queries and metadata filtering

πŸ§ͺ Testing

# Run full test suite
python -m pytest tests/

# Test individual phases (from src directory)
cd src
python validate.py ../data/preprocessed --test-mode
python rag.py --test --limit 100

# Integration tests
python tests/test_full_pipeline.py

πŸ“ˆ Performance Metrics

  • Corpus Size: ~50,000 historical German texts
  • Time Coverage: 1050-2000 CE (950 years)
  • Processing Speed: ~500 texts/minute (Phase 2)
  • Search Latency: <100ms for semantic queries
  • Embedding Creation: ~1000 chunks/minute
  • Database Size: ~5GB (structured) + ~2GB (vectors)

πŸ”¬ Research Applications

Historical Linguistics

  • Track phonological changes across centuries
  • Analyze syntactic evolution patterns
  • Study lexical semantic shifts

Digital Humanities

  • Explore genre-specific language patterns
  • Investigate sociolinguistic variations
  • Support comparative historical analysis

Computational Linguistics

  • Benchmark historical NLP tools
  • Develop diachronic language models
  • Test cross-temporal retrieval systems

πŸ›£οΈ Roadmap

Phase 7: Web Interface (In Progress)

  • Interactive web dashboard
  • Visual timeline of language changes
  • Advanced query builder
  • Export functionality for research data

Future Enhancements

  • Multi-language support (Historical English, Latin)
  • Advanced temporal modeling
  • Integration with linguistic databases
  • Machine learning-based change detection
  • API versioning and rate limiting

🀝 Contributing

We welcome contributions from linguists, historians, and developers!

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add unit tests for new features
  • Update documentation for API changes
  • Use type hints in Python code
  • Write descriptive commit messages

πŸ“š Documentation

πŸ› Troubleshooting

Common Issues

"Numpy is not available" error

pip install numpy==1.24.3 sentence-transformers==2.7.0

PostgreSQL connection errors

# Check PostgreSQL service
sudo service postgresql status
# Update connection config in access_pipeline/config.py

GATE processing failures

# Verify Java installation
java -version
# Check GATE installation in gate_preprocessor/

Getting Help

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • GerManC Corpus Team - For providing the historical German text corpus
  • GATE Development Team - For the robust NLP processing framework
  • Sentence Transformers - For multilingual embedding models
  • ChromaDB Team - For the efficient vector database
  • Digital Humanities Community - For inspiration and research collaboration

πŸ“Š Citation

If you use this system in your research, please cite:

@software{german_rag_system,
  title={Historical German Language Evolution RAG System},
  author={Rohan Dhupar},
  year={2024},
  url={https://github.com/yourusername/historical-language-evolution-rag},
  note={A comprehensive system for analyzing German language evolution using RAG}
}

Built with ❀️ for Historical Linguistics Research

🌟 Star this repo | 🍴 Fork it | πŸ“– Read the docs

About

Historical Language Evolution RAG AI assistant analyzing German language evolution (1600-1900) using historical texts. Answers questions like "How did German word order change?" with cited evidence from historical documents. Uses TEI-XML processing, PostgreSQL + vector DB, and RAG+GPT-4 to make linguistic research interactive.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published