diff --git a/.github/ISSUE_TEMPLATE/atomic-tasks/QUICK_SETUP_GUIDE.md b/.github/ISSUE_TEMPLATE/atomic-tasks/QUICK_SETUP_GUIDE.md index f368769..5fc9682 100644 --- a/.github/ISSUE_TEMPLATE/atomic-tasks/QUICK_SETUP_GUIDE.md +++ b/.github/ISSUE_TEMPLATE/atomic-tasks/QUICK_SETUP_GUIDE.md @@ -35,6 +35,10 @@ Look at completed tasks (M0-M2) for examples of: - Acceptance criteria format - Celebration ideas - Integration patterns +- **Project structure** - See Task 01.1, 01.2, 01.3 for directory layout +- **Documentation format** - README.md, QUICK_START.md, PROJECT_STATUS.txt patterns +- **Testing patterns** - Test organization, fixtures, benchmarks +- **Code organization** - Shared utilities, module structure --- @@ -154,6 +158,7 @@ Based on `14_security_compliance.md`: 2. Fill in details from parent task 3. Add specific examples and code 4. Review acceptance criteria +5. **Copy file structure** from similar completed task (01.1, 01.2, or 01.3) ### Option 2: Batch Creation (Faster) Use the provided breakdown above to create multiple tasks at once: @@ -163,15 +168,33 @@ Use the provided breakdown above to create multiple tasks at once: cd atomic-tasks/M3-semantic/05-embeddings/ cp ../../TASK_TEMPLATE.md 05.1-co-occurrence-matrix.md # Edit and customize... + +# Copy file structure from similar task +cp -r ../../M0-foundation/01-text-processing/01.2-language-detection/* \ + ../../M3-semantic/05-embeddings/05.1-co-occurrence-matrix/ +# Then customize files ``` ### Option 3: AI-Assisted (Fastest) Use the template + parent task + this guide to generate remaining tasks. +### Quick File Copying +When starting a new task, copy these files from completed tasks: +```bash +# From Task 01.2 or 01.3: +cp shared/logger.py [new-task]/shared/ +cp pytest.ini [new-task]/ +cp .gitignore [new-task]/ +cp tests/conftest.py [new-task]/tests/ +cp interactive_test.py [new-task]/ # Then customize +# Then customize for your specific task +``` + --- ## 📝 Quality Checklist for Each Task +### Task Definition - [ ] Clear daily breakdown (specific deliverables per day) - [ ] Measurable acceptance criteria (numbers, metrics) - [ ] Celebration moment defined @@ -180,6 +203,52 @@ Use the template + parent task + this guide to generate remaining tasks. - [ ] Test strategy included - [ ] Resources and tips provided +### Documentation Requirements +- [ ] **`README.md`** includes: + - [ ] Status section (✅ COMPLETE & PRODUCTION-READY format) + - [ ] Performance metrics table + - [ ] Quick start examples + - [ ] Project structure diagram + - [ ] Integration examples +- [ ] **`QUICK_START.md`** - 5-minute setup guide with examples +- [ ] **`PROJECT_STATUS.txt`** - Completion status with ASCII art header +- [ ] **`docs/ALGORITHMS.md`** includes: + - [ ] Detailed algorithm explanations with step-by-step breakdowns + - [ ] Mathematical formulations (if applicable) + - [ ] Code examples demonstrating key concepts + - [ ] Performance characteristics (complexity, optimization strategies) + - [ ] Learning Resources section with categorized links: + - Official documentation and specifications + - Research papers and academic resources + - Tutorials and hands-on guides + - Implementation examples and libraries + - Related concepts and advanced topics + - [ ] References section with citations +- [ ] **`docs/api/[feature]-guide.md`** - API documentation with examples + +### Code Structure +- [ ] Follows project structure pattern (text_processing/, tests/, benchmarks/, shared/, docs/) +- [ ] `shared/logger.py` included (copy from existing tasks or create new) +- [ ] `interactive_test.py` created for Python tasks +- [ ] `benchmarks/[feature]_perf.py` created for Python tasks +- [ ] Proper `__init__.py` files in all packages +- [ ] `tests/conftest.py` with fixtures + +### Configuration Files +- [ ] `setup.py` or `pyproject.toml` configured +- [ ] `requirements.txt` with runtime dependencies +- [ ] `requirements-dev.txt` with dev dependencies +- [ ] `pytest.ini` configured (matches Task 01.2/01.3 pattern) +- [ ] `.gitignore` includes: models/, training_data/, __pycache__/, .coverage, htmlcov/, .benchmarks/ + +### Testing +- [ ] Unit tests (≥85% coverage target) +- [ ] Integration tests with dependent tasks +- [ ] Performance benchmarks +- [ ] Interactive testing tool +- [ ] Edge case tests +- [ ] Error handling tests + --- ## 🎯 Priority Order for Creation @@ -206,17 +275,108 @@ Use the template + parent task + this guide to generate remaining tasks. 2. **Reuse Patterns:** Copy structure from similar completed tasks 3. **Focus on Differences:** Only customize what's unique to each task 4. **Test As You Go:** Validate task structure with team before creating many +5. **Copy Shared Files:** Reuse `shared/logger.py`, `pytest.ini`, `.gitignore` from completed tasks +6. **Follow Documentation Pattern:** Use Task 01.2's README.md and QUICK_START.md as templates +7. **Include Interactive Tools:** Always create `interactive_test.py` for Python tasks +8. **Benchmark Early:** Add performance benchmarks from the start --- ## 🚀 Next Steps -1. Review completed M0-M2 tasks for patterns +1. Review completed M0-M2 tasks for patterns: + - Check `01.1-unicode-normalization` for basic structure + - Check `01.2-language-detection` for ML/training patterns + - Check `01.3-script-specific-processing` for complex processing patterns 2. Choose a milestone to start (recommend M3) 3. Create 2-3 tasks using template -4. Get team feedback -5. Batch-create remaining tasks for that milestone -6. Repeat for other milestones +4. **Verify all deliverables** match checklist above +5. Get team feedback +6. Batch-create remaining tasks for that milestone +7. Repeat for other milestones + +## 📋 Standard File Checklist + +When creating a new task, ensure these files exist: + +### Required Files +- [ ] `README.md` (with status, metrics, examples) +- [ ] `QUICK_START.md` (5-minute guide) +- [ ] `PROJECT_STATUS.txt` (completion status) +- [ ] `docs/ALGORITHMS.md` (detailed algorithms) +- [ ] `docs/api/[feature]-guide.md` (API docs) +- [ ] `setup.py` or `pyproject.toml` +- [ ] `requirements.txt` +- [ ] `requirements-dev.txt` +- [ ] `pytest.ini` +- [ ] `.gitignore` +- [ ] `interactive_test.py` (Python tasks) +- [ ] `benchmarks/[feature]_perf.py` (Python tasks) +- [ ] `shared/logger.py` (copy from Task 01.2/01.3) +- [ ] `tests/conftest.py` + +### Optional Files (As Needed) +- [ ] `examples/integration_example.py` +- [ ] `scripts/[utility].sh` or `.py` +- [ ] `models/README.md` (if using models) +- [ ] `training_data/README.md` (if applicable) +- [ ] `docs/TRAINING_GUIDE.md` (for ML tasks) + +## 🔄 Common Patterns from Completed Tasks + +### Pattern 1: Shared Logger +**Always use:** `shared/logger.py` with structured logging +```python +from shared.logger import setup_logger +logger = setup_logger(__name__) +logger.info("Message", key=value) # Structured logging +``` + +### Pattern 2: Lazy Loading (Heavy Dependencies) +**For ML models, large libraries:** +```python +_model = None + +def _get_model(): + global _model + if _model is None: + import heavy_library + _model = heavy_library.load() + return _model +``` + +### Pattern 3: Error Handling +**Always handle gracefully:** +```python +try: + result = process_text(text) +except Exception as e: + logger.error("Processing failed", error=str(e)) + return default_result # Never crash +``` + +### Pattern 4: Performance Benchmarks +**Include in benchmarks/[feature]_perf.py:** +```python +import time +import pytest + +def test_throughput(): + # Test 1000+ docs/sec requirement + texts = ["sample"] * 1000 + start = time.time() + for text in texts: + process(text) + elapsed = time.time() - start + assert len(texts) / elapsed >= 1000 +``` + +### Pattern 5: Interactive Testing +**Create interactive_test.py:** +- Interactive CLI for manual testing +- Support batch mode (command-line args) +- Show processing steps and results +- Handle errors gracefully --- diff --git a/.github/ISSUE_TEMPLATE/atomic-tasks/TASK_TEMPLATE.md b/.github/ISSUE_TEMPLATE/atomic-tasks/TASK_TEMPLATE.md index 5fdd108..ed8e8a9 100644 --- a/.github/ISSUE_TEMPLATE/atomic-tasks/TASK_TEMPLATE.md +++ b/.github/ISSUE_TEMPLATE/atomic-tasks/TASK_TEMPLATE.md @@ -34,10 +34,13 @@ - [ ] [Coverage/quality target] ## 🧪 Testing Checklist -- [ ] Unit tests (≥85% coverage) -- [ ] Integration tests -- [ ] Performance benchmarks +- [ ] Unit tests (≥85% coverage target) +- [ ] Integration tests with dependent tasks +- [ ] Performance benchmarks (`benchmarks/[feature]_perf.py`) +- [ ] Interactive testing tool (`interactive_test.py`) +- [ ] Edge case tests (empty strings, malformed input, special characters) - [ ] Multi-language validation (if applicable) +- [ ] Error handling tests (missing dependencies, invalid input) ## 🎉 Celebration Criteria (Definition of Done) ✅ **Demo Ready:** [What to demonstrate] @@ -47,10 +50,70 @@ **🎊 Celebration Moment:** [What makes this exciting!] ## 📦 Deliverables + +### Core Implementation Files - `path/to/main_file.[cpp/py]` ([X] lines estimated) - `path/to/test_file.[cpp/py]` ([X]+ test cases) -- `docs/[feature]-guide.md` -- [Additional files as needed] +- `shared/logger.py` (Copy from existing tasks or create new structured logger) +- `shared/__init__.py` + +### Documentation Files (Required) +- `README.md` - Complete overview, quick start, usage examples, performance metrics +- `QUICK_START.md` - 5-minute setup guide with examples +- `docs/ALGORITHMS.md` - Technical implementation details (see requirements below) +- `docs/api/[feature]-guide.md` - API documentation with examples +- `PROJECT_STATUS.txt` - Completion status with acceptance criteria checklist + +### ALGORITHMS.md Requirements +The `docs/ALGORITHMS.md` file must include: +- **Detailed algorithm explanations** with step-by-step breakdowns +- **Mathematical formulations** where applicable +- **Code examples** demonstrating key concepts +- **Performance characteristics** (complexity, optimization strategies) +- **Learning Resources section** with categorized links: + - Official documentation and specifications + - Research papers and academic resources + - Tutorials and hands-on guides + - Implementation examples and libraries + - Related concepts and advanced topics +- **References** section with citations + +See `modules/M0-foundation/01-text-processing/01.2-language-detection/docs/ALGORITHMS.md` and `01.3-script-specific-processing/docs/ALGORITHMS.md` for examples. + +### Testing & Tools +- `tests/test_[feature].py` - Comprehensive test suite (≥85% coverage target) +- `tests/conftest.py` - Pytest fixtures and configuration +- `tests/__init__.py` +- `interactive_test.py` - Interactive CLI for testing (Python tasks) +- `benchmarks/[feature]_perf.py` - Performance benchmarks (Python tasks) + +### Configuration Files +- `setup.py` or `pyproject.toml` - Package configuration + - Use `setup.py` for simple packages + - Use `pyproject.toml` for modern Python packaging (PEP 518) + - Include proper metadata (name, version, description, classifiers) +- `requirements.txt` - Runtime dependencies + - List only production dependencies + - Pin versions (e.g., `>=0.9.2`) +- `requirements-dev.txt` - Development dependencies + - Include `-r requirements.txt` at top + - Add testing, linting, formatting tools +- `pytest.ini` - Test configuration + - Match pattern from Task 01.2/01.3 + - Include coverage settings + - Define custom markers (slow, integration, requires_model, etc.) +- `.gitignore` - Git ignore patterns + - **Required patterns**: `models/`, `training_data/`, `__pycache__/`, `.coverage`, `htmlcov/`, `.benchmarks/` + - Include IDE files (`.vscode/`, `.idea/`) + - Include OS files (`.DS_Store`, `Thumbs.db`) + - Keep directory structure: `!models/.gitkeep`, `!models/README.md` + +### Optional Files (As Needed) +- `examples/integration_example.py` - Integration examples +- `scripts/[utility].sh` or `scripts/[utility].py` - Utility scripts +- `docs/TRAINING_GUIDE.md` - Training guides (for ML tasks) +- `models/README.md` - Model documentation (if using models) +- `training_data/README.md` - Training data documentation (if applicable) ## 🔗 Dependencies & Integration @@ -70,19 +133,99 @@ ## 💡 Tips & Resources +### Project Structure Pattern +Follow the established structure from completed tasks: +``` +[task-name]/ +├── text_processing/ # Main implementation +│ ├── __init__.py +│ └── [main_module].py +├── tests/ # Test suite +│ ├── __init__.py +│ ├── conftest.py +│ └── test_[feature].py +├── benchmarks/ # Performance tests (Python) +│ ├── __init__.py +│ └── [feature]_perf.py +├── shared/ # Shared utilities +│ ├── __init__.py +│ └── logger.py # Structured logging +├── docs/ # Documentation +│ ├── ALGORITHMS.md +│ └── api/ +│ └── [feature]-guide.md +├── examples/ # Integration examples (optional) +│ └── integration_example.py +├── scripts/ # Utility scripts (optional) +│ └── [utility].sh or .py +├── models/ # ML models (if applicable) +│ └── README.md +├── training_data/ # Training data (if applicable) +│ └── README.md +├── README.md # Main documentation +├── QUICK_START.md # Quick start guide +├── PROJECT_STATUS.txt # Completion status +├── interactive_test.py # Interactive testing tool +├── setup.py # Package configuration +├── pyproject.toml # Alternative to setup.py (optional) +├── requirements.txt # Runtime dependencies +├── requirements-dev.txt # Development dependencies +├── pytest.ini # Test configuration +└── .gitignore # Git ignore patterns +``` + ### Common Pitfalls - ⚠️ [Pitfall 1]: [How to avoid] - ⚠️ [Pitfall 2]: [How to avoid] +- ⚠️ **Missing shared/logger.py**: Always copy from existing tasks or create structured logger +- ⚠️ **No interactive_test.py**: Create interactive CLI for easy testing +- ⚠️ **Missing benchmarks**: Include performance benchmarks for Python tasks +- ⚠️ **Incomplete .gitignore**: Include models/, training_data/, __pycache__/, .coverage, htmlcov/ ### Helpful Resources - [Resource 1 with link] - [Resource 2 with link] +- **Reference completed tasks**: Check `01.1-unicode-normalization` and `01.2-language-detection` for patterns +- **Shared logger**: Reuse `shared/logger.py` from existing tasks ### Example Code (if applicable) ```language // Quick example showing key concept ``` +### Documentation Standards + +**README.md** must include: +- Status section: `## ✅ Status: COMPLETE & PRODUCTION-READY` +- Performance metrics table (Target vs Achieved) +- Quick start section with code examples +- Project structure diagram +- Integration examples +- Troubleshooting section + +**QUICK_START.md** format: +- 5-minute setup guide +- Copy-paste examples +- Common tasks section +- Troubleshooting tips +- Reference to full README.md + +**PROJECT_STATUS.txt** format: +- ASCII art header (see Task 01.1/01.2 examples) +- Acceptance criteria checklist +- Test results summary +- Performance benchmarks table +- Deliverables checklist + +**ALGORITHMS.md**: See requirements above - must match depth of Task 01.2/01.3 examples + +### Code Quality Standards +- **Type hints**: Use type hints for all functions +- **Docstrings**: All public APIs must have docstrings +- **Error handling**: Graceful degradation, never crash on malformed input +- **Logging**: Use structured logging (`shared/logger.py`) instead of `print()` or `std::cout` +- **Testing**: ≥85% coverage target, comprehensive edge case tests + ## 📊 Success Metrics - **Quality:** [Metric] - **Performance:** [Metric] diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/README.md b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/README.md new file mode 100644 index 0000000..927cd6d --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/README.md @@ -0,0 +1,192 @@ +# Task 01.3: Script-Specific Processing + +Script-specific text processing module that processes text based on detected script codes from Task 01.2. Handles Arabic scripts (ZWNJ preservation), CJK (word segmentation), Cyrillic (variant unification), Latin (diacritic handling), and mixed-script scenarios. + +## 📋 Overview + +This module provides: +- **Arabic Processing:** ZWNJ preservation, diacritic removal, character shape normalization +- **CJK Processing:** Chinese (jieba segmentation), Japanese tokenization, Korean word boundaries +- **Cyrillic Processing:** Variant unification (ё → е), case folding +- **Latin Processing:** Diacritic normalization, ligature handling +- **Mixed-Script Support:** Handles bidirectional text and mixed scripts +- **High Performance:** 1000+ docs/sec throughput + +## 🚀 Quick Start + +### 1. Install Dependencies + +```bash +pip install -r requirements.txt +``` + +### 2. Basic Usage + +```python +from text_processing import process_by_script + +# Process Arabic text (preserves ZWNJ) +result = process_by_script("می‌خواهم", language_info) +print(result.text) # "می‌خواهم" (ZWNJ preserved) +print(result.applied_rules) # ["preserve_zwnj", "remove_diacritics"] + +# Process Chinese text (word segmentation) +result = process_by_script("你好世界", language_info) +print(result.word_boundaries) # [0, 2, 4] (word boundaries) +``` + +### 3. Run Tests + +```bash +pytest tests/ -v --cov=text_processing +``` + +## 💻 Usage + +### Basic Processing + +```python +from text_processing import ScriptHandler + +handler = ScriptHandler() + +# Process text with LanguageInfo from Task 01.2 +result = handler.process_by_script( + text="می‌خواهم", + language_info=LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.98 + ) +) + +print(result.text) # Processed text +print(result.script_code) # "Arab" +print(result.applied_rules) # ["preserve_zwnj", ...] +``` + +### Mixed-Script Processing + +```python +# Handle bidirectional text +result = handler.process_mixed_script( + text="Hello سلام World", + language_info=LanguageInfo(...) +) + +# Each script segment processed separately +``` + +## 📦 Project Structure + +``` +01.3-script-specific-processing/ +├── text_processing/ +│ ├── script_handler.py # Main orchestrator (300-400 lines) +│ ├── arabic_processor.py # Arabic/Persian/Urdu (150 lines) +│ ├── cjk_processor.py # Chinese/Japanese/Korean (200 lines) +│ ├── cyrillic_processor.py # Cyrillic scripts (100 lines) +│ ├── latin_processor.py # Latin scripts (100 lines) +│ └── __init__.py +├── tests/ +│ └── test_script_processing.py # Comprehensive tests (200+ cases) +├── docs/ +│ ├── ALGORITHMS.md # Technical implementation details +│ └── api/ +│ └── script-processing.md # API documentation +├── examples/ +│ └── integration_example.py # Integration examples +├── shared/ +│ └── logger.py # Logging utilities +├── requirements.txt +├── requirements-dev.txt +├── setup.py +├── pytest.ini +└── README.md +``` + +## 🎯 Supported Scripts + +### Arabic Script (Arab) +- **ZWNJ Preservation:** Critical for Persian grammar +- **Diacritic Removal:** Normalizes Arabic diacritics +- **Shape Normalization:** Handles contextual forms + +### CJK Scripts +- **Chinese (Hans/Hant):** jieba word segmentation +- **Japanese (Jpan):** Tokenization (MeCab optional) +- **Korean (Kore):** Hangul syllable handling + +### Cyrillic Script (Cyrl) +- **Variant Unification:** ё → е (configurable) +- **Case Folding:** Proper Cyrillic case handling + +### Latin Script (Latn) +- **Diacritic Normalization:** é → e (configurable) +- **Ligature Handling:** æ, œ semantic preservation + +## 📊 Performance Requirements + +- **Throughput:** 1000+ docs/sec +- **Latency:** <10ms per document +- **Accuracy:** CJK segmentation ≥85%, ZWNJ preservation 100% + +## 🧪 Testing + +```bash +# Run all tests +pytest tests/ -v + +# Run with coverage +pytest tests/ --cov=text_processing --cov-report=html + +# Run specific script tests +pytest tests/test_script_processing.py::TestArabicProcessor -v +pytest tests/test_script_processing.py::TestCJKProcessor -v +``` + +## 🔗 Integration + +### With Task 01.2 (Language Detection) + +```python +from pathlib import Path +import sys + +# Import LanguageInfo from Task 01.2 +task_01_2_path = Path(__file__).parent.parent / "01.2-language-detection" +sys.path.insert(0, str(task_01_2_path)) +from text_processing import LanguageInfo + +# Use LanguageInfo for processing +result = process_by_script(text, language_info) +``` + +## 📖 Documentation + +- **[README.md](README.md)** - This file (overview & quick start) +- **[ALGORITHMS.md](docs/ALGORITHMS.md)** - Technical implementation details +- **[script-processing.md](docs/api/script-processing.md)** - API reference + +## 🐛 Troubleshooting + +### Import Error: LanguageInfo + +**Solution:** Ensure Task 01.2 is in the correct path or install it as a package. + +### jieba Not Found + +**Solution:** Install dependencies: +```bash +pip install -r requirements.txt +``` + +## 📝 License + +Part of search-engine-core project. + +--- + +**Built with ❤️ for universal multilingual search** + +Last updated: 2025-01-XX diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/ALGORITHMS.md b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/ALGORITHMS.md new file mode 100644 index 0000000..380e6dc --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/ALGORITHMS.md @@ -0,0 +1,993 @@ +# Algorithms and Implementation Details - Task 01.3 + +Deep dive into the algorithms and implementation details of script-specific text processing system. + +## 📋 Table of Contents + +1. [ZWNJ Preservation Algorithm](#zwnj-preservation-algorithm) + - [Overview](#overview) + - [Detailed Algorithm Explanation](#detailed-algorithm-explanation) + - [Why ZWNJ Matters](#why-zwnj-matters) + - [Implementation Details](#implementation-details) +2. [CJK Segmentation Algorithms](#cjk-segmentation-algorithms) + - [Chinese Segmentation (jieba)](#chinese-segmentation-jieba) + - [Japanese Tokenization](#japanese-tokenization) + - [Korean Segmentation](#korean-segmentation) +3. [Cyrillic Normalization Rules](#cyrillic-normalization-rules) + - [Variant Unification Algorithm](#variant-unification-algorithm) + - [Language-Specific Handling](#language-specific-handling) +4. [Latin Script Processing](#latin-script-processing) + - [Diacritic Normalization](#diacritic-normalization) + - [Ligature Handling](#ligature-handling) +5. [Bidirectional Text Handling](#bidirectional-text-handling) + - [Unicode BiDi Algorithm](#unicode-bidi-algorithm) + - [Script Boundary Detection](#script-boundary-detection) +6. [Performance Optimizations](#performance-optimizations) +7. [Unicode Script Detection Details](#unicode-script-detection-details) +8. [Learning Resources](#learning-resources) +9. [References](#references) + +## ZWNJ Preservation Algorithm + +### Overview + +Zero-Width Non-Joiner (ZWNJ, U+200C) is a critical invisible character in Persian (Farsi) text that affects word meaning and grammar. Unlike spaces or visible punctuation, ZWNJ is a **zero-width** character that prevents character joining in Arabic scripts without adding visual spacing. + +### Detailed Algorithm Explanation + +#### What is ZWNJ? + +ZWNJ (U+200C) is a Unicode control character that: +- Has **zero width** (invisible, no visual space) +- Prevents **character joining** in Arabic scripts +- Is **grammatically significant** in Persian +- Must be **preserved** during text processing + +#### Character Properties + +```python +ZWNJ = '\u200C' # Unicode code point U+200C + +# Properties: +# - Category: Cf (Format, other) +# - Bidirectional: BN (Boundary Neutral) +# - Width: 0 (zero-width) +# - Joining: Non-joining (prevents joining) +``` + +#### Algorithm Steps + +**Step 1: Detection** +```python +def preserve_zwnj(text: str) -> str: + """ + Preserve ZWNJ characters in text. + + Algorithm: + 1. Count ZWNJ occurrences + 2. Log for debugging + 3. Return text unchanged (preservation = no removal) + """ + zwnj_count = text.count(ZWNJ) + if zwnj_count > 0: + logger.debug(f"Preserving {zwnj_count} ZWNJ characters") + return text # Critical: Return unchanged! +``` + +**Step 2: Integration in Processing Pipeline** +```python +def process_arabic(text: str, language_code: str) -> ProcessedText: + # Step 1: ALWAYS preserve ZWNJ first (before any other processing) + text = preserve_zwnj(text) # No removal, just validation + + # Step 2: Other processing (diacritics, shapes, etc.) + # ... but ZWNJ remains untouched + + return ProcessedText(text=text, ...) +``` + +### Why ZWNJ Matters + +#### Grammatical Significance + +ZWNJ changes word meaning in Persian: + +**Example 1: Verb Forms** +```python +# With ZWNJ: "می‌خواهم" (mikhāham) = "I want" +text_with_zwnj = "می‌خواهم" # Contains U+200C + +# Without ZWNJ: "میخواهم" (mikhwāham) = Different verb form +text_without_zwnj = "میخواهم" # No ZWNJ + +# These are DIFFERENT words with DIFFERENT meanings! +``` + +**Example 2: Compound Words** +```python +# "خانه‌سازی" (khāne-sāzi) = "house-building" (with ZWNJ) +# "خانه‌سازی" without ZWNJ = incorrect spelling + +# "نیم‌روز" (nim-ruz) = "noon" (with ZWNJ) +# Without ZWNJ = incorrect +``` + +#### Search and Indexing Impact + +**Problem:** If ZWNJ is removed: +- Search for "می‌خواهم" won't find documents with "میخواهم" +- Indexing becomes inconsistent +- Users can't find content + +**Solution:** Always preserve ZWNJ: +- Consistent indexing +- Accurate search results +- Preserves linguistic meaning + +### Implementation Details + +#### Critical Rules + +1. **Never Remove ZWNJ** + ```python + # ❌ WRONG - Never do this! + text = text.replace(ZWNJ, '') # Removes ZWNJ - BREAKS MEANING! + text = text.replace(ZWNJ, ' ') # Converts to space - WRONG! + + # ✅ CORRECT - Always preserve + text = preserve_zwnj(text) # Returns unchanged + ``` + +2. **Never Normalize ZWNJ** + ```python + # ❌ WRONG - Don't normalize + normalized = unicodedata.normalize('NFC', text) # May affect ZWNJ + + # ✅ CORRECT - Preserve as-is + text = preserve_zwnj(text) # No normalization + ``` + +3. **Process Before Other Operations** + ```python + # ✅ CORRECT order: + text = preserve_zwnj(text) # First: preserve ZWNJ + text = remove_diacritics(text) # Then: other processing + # ZWNJ remains intact + ``` + +#### Validation and Logging + +```python +def preserve_zwnj(text: str) -> str: + """Preserve ZWNJ with validation.""" + original_count = text.count(ZWNJ) + + # Process (no changes, but validate) + preserved = text # No modifications + + # Verify preservation + preserved_count = preserved.count(ZWNJ) + if original_count != preserved_count: + logger.error(f"ZWNJ count changed: {original_count} → {preserved_count}") + raise ValueError("ZWNJ preservation failed!") + + if original_count > 0: + logger.debug(f"Preserved {original_count} ZWNJ characters") + + return preserved +``` + +### Mathematical Formulation + +**Preservation Function:** +\[ +\text{preserve\_zwnj}(T) = T \quad \text{where} \quad |T|_{\text{ZWNJ}} = |\text{preserve\_zwnj}(T)|_{\text{ZWNJ}} +\] + +Where: +- \( T \) = input text +- \( |T|_{\text{ZWNJ}} \) = count of ZWNJ characters in \( T \) +- Result: Text with identical ZWNJ count (preservation) + +**Invariant:** +\[ +\forall c \in T : c = \text{ZWNJ} \implies c \in \text{preserve\_zwnj}(T) +\] + +All ZWNJ characters in input must appear in output. + +## CJK Segmentation Algorithms + +### Chinese Segmentation (jieba) + +#### Overview + +Chinese text has no spaces between words. Word segmentation is essential for: +- Search indexing +- Text analysis +- Information retrieval + +#### jieba Algorithm + +**jieba** uses a combination of: +1. **Prefix Dictionary:** Pre-built word dictionary +2. **HMM (Hidden Markov Model):** For unknown words +3. **Viterbi Algorithm:** For optimal segmentation + +#### Detailed Process + +**Step 1: Dictionary Loading** +```python +# jieba loads a dictionary of known words +# Format: word frequency part_of_speech +# Example: +# 你好 1000 +# 世界 500 +# 中国 800 + +# Dictionary size: ~200,000+ words +``` + +**Step 2: Text Segmentation** +```python +def segment_chinese(text: str) -> List[str]: + """ + Segment Chinese text using jieba. + + Algorithm: + 1. Load jieba model (lazy loading) + 2. Use jieba.cut() for segmentation + 3. Return word list + """ + jieba = _get_jieba() # Lazy load + + # jieba.cut() uses: + # - Dictionary matching (longest match) + # - HMM for unknown words + # - Viterbi for optimal path + words = jieba.cut(text, cut_all=False) + + return list(words) +``` + +**Step 3: Word Boundary Extraction** +```python +def get_word_boundaries(text: str, words: List[str]) -> List[int]: + """ + Extract character positions of word boundaries. + + Example: + text = "你好世界" + words = ["你好", "世界"] + boundaries = [0, 2, 4] # Start, end of "你好", end of "世界" + """ + boundaries = [0] + current_pos = 0 + + for word in words: + pos = text.find(word, current_pos) + if pos != -1: + boundaries.append(pos + len(word)) + current_pos = pos + len(word) + + return sorted(set(boundaries)) +``` + +#### jieba Segmentation Modes + +**1. Precise Mode (Default)** +```python +words = jieba.cut(text, cut_all=False) +# "你好世界" → ["你好", "世界"] +# Most accurate, recommended for search +``` + +**2. Full Mode** +```python +words = jieba.cut(text, cut_all=True) +# "你好世界" → ["你", "好", "世界", "你好", "好世", "世界"] +# All possible words (for analysis) +``` + +**3. Search Mode** +```python +words = jieba.cut_for_search(text) +# "你好世界" → ["你好", "世界", "你", "好"] +# Includes shorter words for search +``` + +#### Performance Characteristics + +- **Accuracy:** ~85-95% on general text +- **Speed:** jieba-fast ~10x faster than standard jieba +- **Memory:** ~50MB dictionary +- **Latency:** <5ms per sentence + +### Japanese Tokenization + +#### Overview + +Japanese text mixes three scripts: +- **Hiragana** (ひらがな): Phonetic script +- **Katakana** (カタカナ): Foreign words, emphasis +- **Kanji** (漢字): Chinese characters + +#### MeCab Method (Preferred) + +**MeCab** is a morphological analyzer: + +**Step 1: MeCab Initialization** +```python +def _get_mecab(): + """Lazy load MeCab tagger.""" + global _mecab + if _mecab is None: + import MeCab + # -Owakati: Output format (space-separated words) + _mecab = MeCab.Tagger("-Owakati") + return _mecab +``` + +**Step 2: Tokenization** +```python +def segment_japanese(text: str) -> List[str]: + """Segment Japanese using MeCab.""" + mecab = _get_mecab() + + if mecab: + # MeCab parses text and outputs tokens + output = mecab.parse(text) + # "こんにちは世界" → "こんにちは 世界" + tokens = output.strip().split() + return tokens +``` + +**MeCab Output Format:** +``` +Input: "こんにちは世界" +Output: "こんにちは 世界" +Tokens: ["こんにちは", "世界"] +``` + +#### Regex Fallback Method + +When MeCab unavailable, use script-boundary detection: + +**Algorithm:** +```python +def segment_japanese_regex(text: str) -> List[str]: + """ + Regex-based Japanese segmentation. + + Strategy: + 1. Detect script boundaries (Hiragana/Katakana/Kanji) + 2. Split at script transitions + 3. Handle punctuation separately + """ + tokens = [] + current_token = [] + + for char in text: + script = detect_script_type(char) + + if script_changed(current_token, script): + tokens.append(''.join(current_token)) + current_token = [char] + else: + current_token.append(char) + + return tokens +``` + +**Script Detection:** +```python +def detect_script_type(char: str) -> str: + """Detect script type of character.""" + if '\u3040' <= char <= '\u309F': # Hiragana + return 'hiragana' + elif '\u30A0' <= char <= '\u30FF': # Katakana + return 'katakana' + elif '\u4E00' <= char <= '\u9FAF': # Kanji + return 'kanji' + else: + return 'other' +``` + +**Limitations:** +- Less accurate than MeCab +- Doesn't handle compound words well +- No part-of-speech information + +### Korean Segmentation + +#### Overview + +Korean uses **spaces** for word boundaries (unlike Chinese/Japanese), but also has: +- **Hangul syllables:** Composed characters +- **Compound words:** May need further segmentation + +#### Algorithm + +**Step 1: Space-Based Segmentation** +```python +def segment_korean(text: str) -> List[str]: + """ + Segment Korean text. + + Korean uses spaces, so basic segmentation is simple: + """ + # Split by spaces (primary method) + words = text.split() + + # Further processing could include: + # - Morphological analysis (KoNLPy) + # - Compound word splitting + # - Honorific handling + + return words +``` + +**Example:** +```python +text = "안녕하세요 세계" +words = segment_korean(text) +# ["안녕하세요", "세계"] +``` + +**Future Enhancement:** +- Use KoNLPy for morphological analysis +- Handle compound words +- Extract morphemes for better indexing + +## Cyrillic Normalization Rules + +### Variant Unification Algorithm + +#### Overview + +Cyrillic script has character variants: +- **ё (U+0451)** vs **е (U+0435)**: Often normalized for consistency +- **і (U+0456)** vs **и (U+0438)**: Ukrainian/Belarusian specific + +#### Algorithm Steps + +**Step 1: Character Detection** +```python +def unify_cyrillic_variants(text: str, normalize_yo: bool = True) -> str: + """ + Unify Cyrillic variants. + + Algorithm: + 1. Iterate through characters + 2. Detect ё (U+0451) + 3. Replace with е (U+0435) if normalize_yo=True + 4. Preserve language-specific characters + """ + result = [] + normalized_count = 0 + + for char in text: + if char == CYRILLIC_YO and normalize_yo: + result.append(CYRILLIC_E) # ё → е + normalized_count += 1 + else: + result.append(char) # Preserve + + return ''.join(result) +``` + +**Step 2: Language-Specific Handling** +```python +# Russian: Usually normalize ё → е +text_ru = "ёлка" +normalized = unify_cyrillic_variants(text_ru, normalize_yo=True) +# "елка" + +# Ukrainian: Preserve specific characters +# і (U+0456), ї (U+0457), ґ (U+0491) are preserved +``` + +#### Mathematical Formulation + +**Normalization Function:** +\[ +\text{normalize}(c) = \begin{cases} +е & \text{if } c = ё \text{ and } \text{normalize\_yo} = \text{True} \\ +c & \text{otherwise} +\end{cases} +\] + +**Text Transformation:** +\[ +\text{unify}(T) = [\text{normalize}(c) \text{ for } c \in T] +\] + +### Language-Specific Handling + +#### Character Preservation Rules + +**Ukrainian Characters:** +- **і (U+0456)**: Ukrainian/Belarusian i (preserve) +- **ї (U+0457)**: Ukrainian yi (preserve) +- **ґ (U+0491)**: Ukrainian ghe (preserve) + +**Belarusian Characters:** +- **і (U+0456)**: Also Belarusian (preserve) + +**Implementation:** +```python +UKRAINIAN_SPECIFIC = { + '\u0456', # і + '\u0457', # ї + '\u0491', # ґ +} + +# These are NEVER normalized +# They are language-specific and must be preserved +``` + +## Latin Script Processing + +### Diacritic Normalization + +#### Overview + +Latin script uses diacritics (accents) that may need normalization: +- **é → e**: French/Spanish +- **ñ → n**: Spanish +- **ü → u**: German/Turkish + +#### Algorithm + +**Step 1: Unicode Decomposition** +```python +def normalize_diacritics(text: str) -> str: + """ + Normalize diacritics using Unicode NFD decomposition. + + Algorithm: + 1. Decompose characters (é → e + combining acute) + 2. Remove combining marks (diacritics) + 3. Reconstruct base characters + """ + normalized = [] + + for char in text: + # NFD: Normalization Form Decomposed + decomposed = unicodedata.normalize('NFD', char) + # "é" → "e" + "\u0301" (combining acute) + + # Keep only base characters (remove combining marks) + base_chars = [ + c for c in decomposed + if unicodedata.category(c) != 'Mn' # Mn = Nonspacing Mark + ] + + normalized.extend(base_chars) + + return ''.join(normalized) +``` + +**Example:** +```python +text = "café" +normalized = normalize_diacritics(text) +# "cafe" (é → e) +``` + +#### Language-Specific Preservation + +**Languages that preserve diacritics:** +- French: é, è, ê, ç +- Spanish: ñ, á, é, í, ó, ú +- German: ä, ö, ü, ß +- Polish: ą, ć, ę, ł, ń, ó, ś, ź, ż + +**Implementation:** +```python +PRESERVE_DIACRITICS_LANGUAGES = { + 'fr', 'es', 'de', 'pl', 'cs', 'sk', 'hu', 'ro', 'tr', 'vi', + 'is', 'da', 'no', 'sv', 'fi', 'et', 'lv', 'lt' +} + +def process_latin(text: str, language_code: str) -> ProcessedText: + should_preserve = language_code in PRESERVE_DIACRITICS_LANGUAGES + + if not should_preserve: + text = normalize_diacritics(text) + + return ProcessedText(text=text, ...) +``` + +### Ligature Handling + +#### Overview + +Ligatures are combined characters: +- **Semantic ligatures:** æ, œ (have meaning) +- **Non-semantic ligatures:** fi, fl (typographic) + +#### Algorithm + +**Step 1: Classification** +```python +SEMANTIC_LIGATURES = { + 'æ': 'ae', # Latin ligature + 'œ': 'oe', # French ligature +} + +NON_SEMANTIC_LIGATURES = { + 'fi': 'fi', # Typographic + 'fl': 'fl', + 'ff': 'ff', +} +``` + +**Step 2: Normalization** +```python +def handle_ligatures(text: str, preserve_semantic: bool = True) -> str: + """ + Handle ligatures. + + Strategy: + - Preserve semantic ligatures (æ, œ) by default + - Always normalize non-semantic ligatures + """ + result = [] + + for char in text: + if char in SEMANTIC_LIGATURES: + if preserve_semantic: + result.append(char) # Preserve + else: + result.append(SEMANTIC_LIGATURES[char]) # Normalize + elif char in NON_SEMANTIC_LIGATURES: + result.append(NON_SEMANTIC_LIGATURES[char]) # Always normalize + else: + result.append(char) + + return ''.join(result) +``` + +## Bidirectional Text Handling + +### Unicode BiDi Algorithm + +#### Overview + +Bidirectional text mixes RTL (Right-to-Left) and LTR (Left-to-Right) scripts: +- **RTL:** Arabic, Hebrew +- **LTR:** Latin, Cyrillic, CJK + +#### Unicode Bidirectional Algorithm + +The Unicode BiDi algorithm determines text display order: + +**Basic Rules:** +1. **Strong characters:** Determine direction (Arabic = RTL, Latin = LTR) +2. **Neutral characters:** Inherit direction from context +3. **Directional overrides:** Explicit direction markers + +**Implementation:** +```python +def handle_bidirectional_text(text: str) -> str: + """ + Handle bidirectional text. + + Note: Full BiDi algorithm is complex. + We preserve logical order here. + For rendering, use python-bidi library. + """ + # Preserve logical order (as stored in memory) + # Visual rendering handled by rendering system + return text +``` + +### Script Boundary Detection + +#### Algorithm + +**Step 1: Character-by-Character Analysis** +```python +def detect_script_boundaries(text: str) -> List[Tuple[int, int, str]]: + """ + Detect script boundaries in mixed-script text. + + Algorithm: + 1. Iterate character by character + 2. Skip whitespace/punctuation + 3. Match against script patterns + 4. Detect script transitions + 5. Create boundary segments + """ + boundaries = [] + current_script = None + start_pos = 0 + + for i, char in enumerate(text): + if char.isspace() or not char.isalnum(): + continue # Skip + + # Detect script + detected_script = detect_script_for_char(char) + + # If script changed, save segment + if current_script and detected_script != current_script: + boundaries.append((start_pos, i, current_script)) + start_pos = i + current_script = detected_script + elif not current_script: + current_script = detected_script + start_pos = i + + # Add final segment + if current_script: + boundaries.append((start_pos, len(text), current_script)) + + return boundaries +``` + +**Step 2: Script Pattern Matching** +```python +SCRIPT_PATTERNS = { + 'Arab': re.compile(r'[\u0600-\u06FF...]'), # Arabic ranges + 'Latn': re.compile(r'[a-zA-Z]'), + 'Cyrl': re.compile(r'[\u0400-\u04FF]'), + # ... more patterns +} + +def detect_script_for_char(char: str) -> str: + """Detect script for single character.""" + for script_code, pattern in SCRIPT_PATTERNS.items(): + if pattern.match(char): + return script_code + return "Zyyy" # Common/Unknown +``` + +**Example:** +```python +text = "Hello سلام World" +boundaries = detect_script_boundaries(text) +# [(0, 6, "Latn"), (6, 10, "Arab"), (10, 16, "Latn")] +``` + +## Performance Optimizations + +### Lazy Loading + +**Heavy Dependencies:** +- jieba: ~50MB dictionary +- MeCab: ~100MB+ model + +**Implementation:** +```python +_jieba = None +_mecab = None + +def _get_jieba(): + """Lazy load jieba.""" + global _jieba + if _jieba is None: + import jieba_fast as jieba + _jieba = jieba + return _jieba +``` + +**Benefits:** +- Faster startup (don't load unused libraries) +- Lower memory usage +- Better error handling (fail only when needed) + +### Caching + +**Compiled Regex Patterns:** +```python +# Module-level compilation (cached) +SCRIPT_PATTERNS = { + 'Arab': re.compile(r'[\u0600-\u06FF...]'), # Compiled once + 'Latn': re.compile(r'[a-zA-Z]'), + # ... +} +``` + +**Model Caching:** +- jieba dictionary loaded once +- MeCab tagger initialized once +- Reused across multiple calls + +### Optimized Hot Paths + +**Script Detection:** +- Fast regex matching (O(1) per character) +- Early exit for single-script text +- Minimal memory allocations + +**ZWNJ Handling:** +- Simple character counting (O(n)) +- No complex processing +- Zero allocations + +## Unicode Script Detection Details + +### Script Detection Patterns + +**Unicode Ranges:** +```python +SCRIPT_PATTERNS = { + # Arabic script ranges + 'Arab': re.compile(r'[\u0600-\u06FF' # Basic Arabic + r'\u0750-\u077F' # Supplement + r'\u08A0-\u08FF' # Extended-A + r'\uFB50-\uFDFF' # Presentation Forms-A + r'\uFE70-\uFEFF]'), # Presentation Forms-B + + # Latin script + 'Latn': re.compile(r'[a-zA-Z]'), + + # Cyrillic script + 'Cyrl': re.compile(r'[\u0400-\u04FF]'), + + # CJK Unified Ideographs + 'Hans': re.compile(r'[\u4E00-\u9FFF]'), + 'Hant': re.compile(r'[\u4E00-\u9FFF]'), + + # Japanese scripts + 'Jpan': re.compile(r'[\u3040-\u309F' # Hiragana + r'\u30A0-\u30FF' # Katakana + r'\u4E00-\u9FFF]'), # Kanji + + # Korean script + 'Kore': re.compile(r'[\uAC00-\uD7AF' # Hangul Syllables + r'\u1100-\u11FF' # Hangul Jamo + r'\u3130-\u318F]'), # Compatibility Jamo +} +``` + +### ISO 15924 Script Codes + +**Standard Script Codes:** +- `Arab`: Arabic script +- `Latn`: Latin script +- `Cyrl`: Cyrillic script +- `Hans`: Simplified Chinese (Han) +- `Hant`: Traditional Chinese (Han) +- `Jpan`: Japanese +- `Kore`: Korean +- `Zyyy`: Common (unknown script) + +### Boundary Detection Complexity + +**Time Complexity:** O(n) where n = text length +- Single pass through text +- Constant-time pattern matching per character + +**Space Complexity:** O(k) where k = number of script segments +- Store boundary tuples +- Typically k << n + +## Learning Resources + +### ZWNJ and Arabic Script Processing + +**ZWNJ Fundamentals:** +- **[Unicode ZWNJ Specification](https://unicode.org/charts/PDF/U2000.pdf)** - Official Unicode specification for ZWNJ (U+200C) +- **[Zero-Width Non-Joiner (Wikipedia)](https://en.wikipedia.org/wiki/Zero-width_non-joiner)** - Comprehensive explanation of ZWNJ +- **[Persian Text Processing Guide](https://www.unicode.org/reports/tr53/)** - Unicode Technical Report on Persian text handling + +**Arabic Script Processing:** +- **[Arabic Text Processing (NLP Guide)](https://www.nltk.org/book/ch12.html)** - Natural Language Processing with Python chapter on Arabic +- **[python-arabic-reshaper](https://github.com/mpcabd/python-arabic-reshaper)** - Python library for Arabic text reshaping +- **[Arabic NLP Resources](https://github.com/ARBML/ARBML)** - Arabic NLP tools and resources + +### CJK Text Segmentation + +**Chinese Word Segmentation:** +- **[jieba GitHub Repository](https://github.com/fxsjy/jieba)** - Official jieba library with documentation +- **[jieba-fast Documentation](https://github.com/deepcs233/jieba_fast)** - Fast jieba implementation +- **[Chinese Word Segmentation Survey](https://arxiv.org/abs/1808.04911)** - Research paper on Chinese segmentation methods +- **[Stanford Chinese NLP](https://web.stanford.edu/class/cs224n/)** - Stanford NLP course covering Chinese processing + +**Japanese Tokenization:** +- **[MeCab Official Site](https://taku910.github.io/mecab/)** - MeCab morphological analyzer documentation +- **[MeCab Python Tutorial](https://github.com/SamuraiT/mecab-python3)** - Python bindings for MeCab +- **[Japanese NLP Guide](https://www.nltk.org/book/ch12.html)** - NLTK book chapter on Japanese processing +- **[SudachiPy](https://github.com/WorksApplications/SudachiPy)** - Alternative Japanese tokenizer + +**Korean Text Processing:** +- **[KoNLPy Documentation](https://konlpy.org/)** - Korean NLP library for Python +- **[Korean Morphological Analysis](https://github.com/hyunwoongko/kss)** - Korean sentence splitter +- **[Hangul Processing Guide](https://github.com/kaniblu/hangul-toolkit)** - Hangul text processing tools + +### Cyrillic Script Processing + +**Cyrillic Normalization:** +- **[Cyrillic Script (Wikipedia)](https://en.wikipedia.org/wiki/Cyrillic_script)** - Overview of Cyrillic script +- **[Unicode Cyrillic Blocks](https://unicode.org/charts/PDF/U0400.pdf)** - Unicode specification for Cyrillic +- **[Russian Text Processing](https://github.com/natasha/natasha)** - Russian NLP library + +**Variant Handling:** +- **[ё vs е Normalization](https://en.wikipedia.org/wiki/Yo_(Cyrillic))** - Explanation of ё character +- **[Ukrainian Character Encoding](https://en.wikipedia.org/wiki/Ukrainian_alphabet)** - Ukrainian-specific characters + +### Latin Script Processing + +**Diacritic Normalization:** +- **[Unicode Normalization Forms](https://unicode.org/reports/tr15/)** - Unicode Technical Report on normalization +- **[NFD vs NFC Normalization](https://en.wikipedia.org/wiki/Unicode_equivalence)** - Understanding Unicode equivalence +- **[Diacritics in European Languages](https://en.wikipedia.org/wiki/Diacritic)** - Comprehensive guide to diacritics + +**Ligature Handling:** +- **[Typographic Ligatures](https://en.wikipedia.org/wiki/Typographic_ligature)** - Explanation of ligatures +- **[Unicode Ligatures](https://unicode.org/charts/PDF/UFB00.pdf)** - Unicode ligature characters + +### Bidirectional Text + +**Unicode BiDi Algorithm:** +- **[Unicode Bidirectional Algorithm](https://unicode.org/reports/tr9/)** - Official Unicode BiDi specification +- **[python-bidi Library](https://github.com/MeirKhalili/python-bidi)** - Python implementation of BiDi algorithm +- **[RTL Text Rendering Guide](https://www.w3.org/International/questions/qa-html-dir)** - W3C guide to RTL text + +**Mixed-Script Handling:** +- **[Script Detection Algorithms](https://unicode.org/reports/tr24/)** - Unicode script detection +- **[Multilingual Text Processing](https://www.nltk.org/book/ch12.html)** - NLTK multilingual processing + +### Performance Optimization + +**Lazy Loading Patterns:** +- **[Python Lazy Loading Patterns](https://realpython.com/python-import/)** - Real Python guide to imports +- **[Memory Optimization Techniques](https://docs.python.org/3/library/sys.html#sys.getsizeof)** - Python memory management + +**Regex Optimization:** +- **[Python Regex Performance](https://docs.python.org/3/library/re.html)** - Official regex documentation +- **[Compiled Regex Patterns](https://docs.python.org/3/library/re.html#re.compile)** - Using compiled patterns + +### General Unicode and Text Processing + +**Unicode Fundamentals:** +- **[Unicode Standard](https://unicode.org/standard/standard.html)** - Official Unicode standard +- **[Unicode Tutorial (Joel Spolsky)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)** - Classic Unicode tutorial +- **[Unicode Character Database](https://www.unicode.org/ucd/)** - Unicode character database + +**Text Processing Libraries:** +- **[NLTK Book](https://www.nltk.org/book/)** - Natural Language Processing with Python +- **[spaCy Documentation](https://spacy.io/usage/linguistic-features)** - Modern NLP library +- **[Text Processing Best Practices](https://docs.python.org/3/library/string.html)** - Python string processing + +### Implementation Examples + +**Real-World Implementations:** +- **[Google's Text Normalization](https://github.com/google/text-normalization)** - Google's text normalization library +- **[Facebook's FastText](https://fasttext.cc/)** - Text classification and language detection +- **[OpenNLP](https://opennlp.apache.org/)** - Apache OpenNLP for text processing + +**Code Examples:** +- **[Python Text Processing Cookbook](https://github.com/PacktPublishing/Python-Text-Processing-Cookbook)** - Practical examples +- **[NLP with Python Examples](https://github.com/nltk/nltk_book)** - NLTK book code examples + +## References + +1. **Unicode Standards:** + - [Unicode Standard](https://unicode.org/standard/standard.html) + - [ISO 15924 Script Codes](https://unicode.org/iso15924/) + - [Unicode Technical Report #9: BiDi Algorithm](https://unicode.org/reports/tr9/) + +2. **ZWNJ and Arabic:** + - Unicode Character Database: U+200C (ZWNJ) + - [Persian Text Processing (Unicode TR53)](https://www.unicode.org/reports/tr53/) + +3. **CJK Segmentation:** + - [jieba: Chinese Word Segmentation](https://github.com/fxsjy/jieba) + - [MeCab: Japanese Morphological Analyzer](https://taku910.github.io/mecab/) + - [KoNLPy: Korean NLP](https://konlpy.org/) + +4. **Text Processing:** + - [NLTK Book: Multilingual Text Processing](https://www.nltk.org/book/ch12.html) + - [Unicode Normalization Forms (TR15)](https://unicode.org/reports/tr15/) + +5. **Performance:** + - [Python Performance Tips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips) + - [Regex Optimization Guide](https://docs.python.org/3/library/re.html) + +--- + +**Technical implementation for script-specific multilingual text processing** 🔬 diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/api/script-processing.md b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/api/script-processing.md new file mode 100644 index 0000000..6b5b991 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/docs/api/script-processing.md @@ -0,0 +1,476 @@ +# API Reference - Script-Specific Processing + +Complete API documentation for Task 01.3 script-specific text processing. + +## Table of Contents + +1. [Main API](#main-api) +2. [Arabic Processor](#arabic-processor) +3. [CJK Processor](#cjk-processor) +4. [Cyrillic Processor](#cyrillic-processor) +5. [Latin Processor](#latin-processor) +6. [Script Handler](#script-handler) +7. [Data Structures](#data-structures) + +## Main API + +### `process_by_script` + +Convenience function to process text by script. + +```python +from text_processing import process_by_script + +result = process_by_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): Input text to process +- `language_info` (LanguageInfo): Language detection result from Task 01.2 +- `**kwargs`: Processor-specific options (see individual processors) + +**Returns:** +- `ProcessedText`: Processed text with metadata + +**Example:** +```python +from text_processing import process_by_script + +language_info = LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.98 +) + +result = process_by_script("می‌خواهم", language_info) +print(result.text) # "می‌خواهم" (ZWNJ preserved) +``` + +### `process_mixed_script` + +Process mixed-script text by detecting boundaries. + +```python +from text_processing import process_mixed_script + +result = process_mixed_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): Input text (may contain multiple scripts) +- `language_info` (LanguageInfo): Language detection result +- `**kwargs`: Processor-specific options + +**Returns:** +- `ProcessedText`: Processed text with metadata + +**Example:** +```python +result = process_mixed_script("Hello سلام", language_info) +# Processes each script segment separately +``` + +## Arabic Processor + +### `process_arabic` + +Process Arabic script text (Arabic, Persian, Urdu). + +```python +from text_processing.arabic_processor import process_arabic + +result = process_arabic( + text: str, + language_code: str, + preserve_diacritics: bool = False, + normalize_shapes: bool = True +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): Arabic script text +- `language_code` (str): ISO 639-1 code (ar, fa, ur, etc.) +- `preserve_diacritics` (bool): If True, keep Arabic diacritics +- `normalize_shapes` (bool): If True, normalize character shapes + +**Returns:** +- `ProcessedText`: Processed text with ZWNJ preserved + +**Example:** +```python +result = process_arabic("می‌خواهم", "fa") +# ZWNJ preserved, diacritics removed +``` + +### `preserve_zwnj` + +Preserve ZWNJ characters (critical for Persian). + +```python +from text_processing.arabic_processor import preserve_zwnj + +text = preserve_zwnj(text: str) -> str +``` + +### `remove_arabic_diacritics` + +Remove Arabic diacritics (tashkeel). + +```python +from text_processing.arabic_processor import remove_arabic_diacritics + +text = remove_arabic_diacritics(text: str) -> str +``` + +### `normalize_arabic_shapes` + +Normalize Arabic character shapes. + +```python +from text_processing.arabic_processor import normalize_arabic_shapes + +text = normalize_arabic_shapes(text: str) -> str +``` + +## CJK Processor + +### `process_cjk` + +Process CJK script text (Chinese, Japanese, Korean). + +```python +from text_processing.cjk_processor import process_cjk + +result = process_cjk( + text: str, + language_code: str, + script_code: str +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): CJK text +- `language_code` (str): ISO 639-1 code (zh, ja, ko) +- `script_code` (str): ISO 15924 code (Hans, Hant, Jpan, Kore) + +**Returns:** +- `ProcessedText`: Segmented text with word boundaries + +**Example:** +```python +result = process_cjk("你好世界", "zh", "Hans") +print(result.word_boundaries) # [0, 2, 4] +``` + +### `segment_chinese` + +Segment Chinese text using jieba. + +```python +from text_processing.cjk_processor import segment_chinese + +words = segment_chinese(text: str) -> List[str] +``` + +**Requires:** jieba or jieba-fast + +### `segment_japanese` + +Segment Japanese text (MeCab or regex fallback). + +```python +from text_processing.cjk_processor import segment_japanese + +tokens = segment_japanese(text: str) -> List[str] +``` + +**Optional:** MeCab for better accuracy + +### `segment_korean` + +Segment Korean text. + +```python +from text_processing.cjk_processor import segment_korean + +words = segment_korean(text: str) -> List[str] +``` + +### `get_word_boundaries` + +Get character positions of word boundaries. + +```python +from text_processing.cjk_processor import get_word_boundaries + +boundaries = get_word_boundaries(text: str, words: List[str]) -> List[int] +``` + +## Cyrillic Processor + +### `process_cyrillic` + +Process Cyrillic script text. + +```python +from text_processing.cyrillic_processor import process_cyrillic + +result = process_cyrillic( + text: str, + language_code: str, + normalize_yo: bool = True +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): Cyrillic text +- `language_code` (str): ISO 639-1 code (ru, uk, be, etc.) +- `normalize_yo` (bool): If True, normalize ё to е + +**Returns:** +- `ProcessedText`: Processed text with variants unified + +**Example:** +```python +result = process_cyrillic("ёлка", "ru", normalize_yo=True) +# "елка" (ё normalized to е) +``` + +### `unify_cyrillic_variants` + +Unify Cyrillic character variants. + +```python +from text_processing.cyrillic_processor import unify_cyrillic_variants + +text = unify_cyrillic_variants( + text: str, + normalize_yo: bool = True, + language_code: str = "" +) -> str +``` + +## Latin Processor + +### `process_latin` + +Process Latin script text. + +```python +from text_processing.latin_processor import process_latin + +result = process_latin( + text: str, + language_code: str, + normalize_diacritics_flag: bool = False, + preserve_semantic_ligatures: bool = True +) -> ProcessedText +``` + +**Parameters:** +- `text` (str): Latin text +- `language_code` (str): ISO 639-1 code (en, fr, es, etc.) +- `normalize_diacritics_flag` (bool): If True, normalize diacritics +- `preserve_semantic_ligatures` (bool): If True, preserve æ, œ + +**Returns:** +- `ProcessedText`: Processed text + +**Example:** +```python +result = process_latin("café", "fr", normalize_diacritics_flag=False) +# Diacritics preserved for French +``` + +### `normalize_diacritics` + +Normalize diacritics in Latin text. + +```python +from text_processing.latin_processor import normalize_diacritics + +text = normalize_diacritics( + text: str, + preserve_for_languages: List[str] = None +) -> str +``` + +### `handle_ligatures` + +Handle ligatures in Latin text. + +```python +from text_processing.latin_processor import handle_ligatures + +text = handle_ligatures( + text: str, + preserve_semantic: bool = True +) -> str +``` + +## Script Handler + +### `ScriptHandler` + +Main handler class for script-specific processing. + +```python +from text_processing import ScriptHandler + +handler = ScriptHandler() +``` + +### Methods + +#### `process_by_script` + +Process text based on script code. + +```python +result = handler.process_by_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText +``` + +#### `process_mixed_script` + +Process mixed-script text. + +```python +result = handler.process_mixed_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText +``` + +## Data Structures + +### `ProcessedText` + +Result dataclass for processed text. + +```python +@dataclass +class ProcessedText: + text: str # Script-processed text + original: str # Original text + script_code: str # ISO 15924 script + language_code: str # ISO 639-1 language + applied_rules: List[str] # Processing rules applied + word_boundaries: List[int] # Word boundaries (for CJK) + confidence: float # Language detection confidence +``` + +**Example:** +```python +result = ProcessedText( + text="processed", + original="original", + script_code="Arab", + language_code="fa", + applied_rules=["preserve_zwnj", "remove_diacritics"], + word_boundaries=[], + confidence=0.98 +) +``` + +### `LanguageInfo` + +Language detection result from Task 01.2. + +```python +@dataclass +class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None +``` + +## Configuration Options + +### Arabic Processing + +- `preserve_diacritics`: Keep Arabic diacritics (default: False) +- `normalize_shapes`: Normalize character shapes (default: True) + +### Cyrillic Processing + +- `normalize_yo`: Normalize ё to е (default: True) + +### Latin Processing + +- `normalize_diacritics`: Normalize diacritics (default: False) +- `preserve_semantic_ligatures`: Preserve æ, œ (default: True) + +## Error Handling + +### Missing Dependencies + +**jieba:** Required for Chinese segmentation +```python +ImportError: jieba or jieba-fast required for Chinese processing +``` + +**MeCab:** Optional for Japanese (falls back to regex) + +### Invalid Input + +**Empty String:** Returns empty `ProcessedText` + +**Unknown Script:** Returns text as-is with `no_processing` rule + +## Performance Notes + +- **Lazy Loading:** Heavy dependencies loaded on-demand +- **Caching:** Compiled patterns and models cached +- **Throughput:** 1000+ docs/sec target +- **Latency:** <10ms per document target + +## Integration Examples + +### With Task 01.2 + +```python +# Import LanguageInfo from Task 01.2 +from pathlib import Path +import sys + +task_01_2_path = Path(__file__).parent.parent / "01.2-language-detection" +sys.path.insert(0, str(task_01_2_path)) +from text_processing import LanguageInfo + +# Use for processing +language_info = LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.98 +) + +result = process_by_script("می‌خواهم", language_info) +``` + +### Full Pipeline + +```python +# Task 01.1: Normalize +normalized = normalize_text(text) + +# Task 01.2: Detect language +language_info = detect_language(normalized) + +# Task 01.3: Process by script +processed = process_by_script(normalized, language_info) +``` diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/examples/integration_example.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/examples/integration_example.py new file mode 100644 index 0000000..85baa83 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/examples/integration_example.py @@ -0,0 +1,301 @@ +""" +Integration Examples - Task 01.3 + +Demonstrates integration with Task 01.2 (Language Detection) +and usage of script-specific processors. +""" + +import sys +from pathlib import Path + +# Add Task 01.2 to path for LanguageInfo import +task_01_2_path = Path(__file__).parent.parent.parent / "01.2-language-detection" +if task_01_2_path.exists(): + sys.path.insert(0, str(task_01_2_path)) + try: + from text_processing import LanguageInfo, detect_language + HAS_TASK_01_2 = True + except ImportError: + HAS_TASK_01_2 = False + # Fallback LanguageInfo + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] +else: + HAS_TASK_01_2 = False + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] + +# Import Task 01.3 +from text_processing import ( + ScriptHandler, + process_by_script, + process_mixed_script, +) + + +def example_arabic_processing(): + """Example: Arabic script processing with ZWNJ preservation.""" + print("=" * 60) + print("Example 1: Arabic Script Processing") + print("=" * 60) + + handler = ScriptHandler() + + # Persian text with ZWNJ + text = "می‌خواهم" + language_info = LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.98 + ) + + result = handler.process_by_script(text, language_info) + + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Script: {result.script_code}") + print(f"Language: {result.language_code}") + print(f"Rules: {', '.join(result.applied_rules)}") + print(f"ZWNJ preserved: {'\u200C' in result.text}") + print() + + +def example_chinese_processing(): + """Example: Chinese word segmentation.""" + print("=" * 60) + print("Example 2: Chinese Word Segmentation") + print("=" * 60) + + handler = ScriptHandler() + + text = "你好世界" + language_info = LanguageInfo( + language_code="zh", + script_code="Hans", + confidence=0.99 + ) + + try: + result = handler.process_by_script(text, language_info) + + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Script: {result.script_code}") + print(f"Language: {result.language_code}") + print(f"Rules: {', '.join(result.applied_rules)}") + print(f"Word boundaries: {result.word_boundaries}") + except ImportError as e: + print(f"Error: {e}") + print("Install jieba-fast: pip install jieba-fast") + print() + + +def example_cyrillic_processing(): + """Example: Cyrillic variant unification.""" + print("=" * 60) + print("Example 3: Cyrillic Variant Unification") + print("=" * 60) + + handler = ScriptHandler() + + text = "ёлка" + language_info = LanguageInfo( + language_code="ru", + script_code="Cyrl", + confidence=0.98 + ) + + # Normalize ё → е + result = handler.process_by_script(text, language_info, normalize_yo=True) + + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Script: {result.script_code}") + print(f"Rules: {', '.join(result.applied_rules)}") + print(f"ё normalized: {'ё' not in result.text}") + print() + + +def example_latin_processing(): + """Example: Latin script processing.""" + print("=" * 60) + print("Example 4: Latin Script Processing") + print("=" * 60) + + handler = ScriptHandler() + + # English text + text = "Hello World" + language_info = LanguageInfo( + language_code="en", + script_code="Latn", + confidence=0.99 + ) + + result = handler.process_by_script(text, language_info) + + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Script: {result.script_code}") + print(f"Rules: {', '.join(result.applied_rules)}") + print() + + # French text (preserves diacritics) + text_fr = "café" + language_info_fr = LanguageInfo( + language_code="fr", + script_code="Latn", + confidence=0.97 + ) + + result_fr = handler.process_by_script(text_fr, language_info_fr) + + print(f"French text: {result_fr.original}") + print(f"Processed: {result_fr.text}") + print(f"Diacritics preserved: {'é' in result_fr.text}") + print() + + +def example_mixed_script(): + """Example: Mixed-script text processing.""" + print("=" * 60) + print("Example 5: Mixed-Script Processing") + print("=" * 60) + + handler = ScriptHandler() + + # Arabic + Latin mixed + text = "Hello سلام World" + language_info = LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.95, + is_mixed_content=True + ) + + result = handler.process_mixed_script(text, language_info) + + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Scripts detected: Multiple") + print(f"Rules: {', '.join(result.applied_rules)}") + print() + + +def example_integration_with_task_01_2(): + """Example: Integration with Task 01.2 language detection.""" + print("=" * 60) + print("Example 6: Integration with Task 01.2") + print("=" * 60) + + if not HAS_TASK_01_2: + print("Task 01.2 not available - using mock LanguageInfo") + print() + return + + # Detect language first + texts = [ + "Hello World", + "می‌خواهم", + "你好世界", + "Привет мир" + ] + + handler = ScriptHandler() + + for text in texts: + print(f"\nText: {text}") + + # Detect language (Task 01.2) + language_info = detect_language(text) + + print(f"Detected: {language_info.language_code} ({language_info.script_code})") + print(f"Confidence: {language_info.confidence:.2f}") + + # Process by script (Task 01.3) + result = handler.process_by_script(text, language_info) + + print(f"Processed: {result.text}") + print(f"Rules: {', '.join(result.applied_rules)}") + print() + + +def example_performance_benchmark(): + """Example: Performance benchmarking.""" + print("=" * 60) + print("Example 7: Performance Benchmark") + print("=" * 60) + + import time + + handler = ScriptHandler() + + # Test texts + texts = { + "Arabic": ("می‌خواهم", LanguageInfo("fa", "Arab", 0.98)), + "Latin": ("Hello World", LanguageInfo("en", "Latn", 0.99)), + "Cyrillic": ("Привет", LanguageInfo("ru", "Cyrl", 0.98)), + } + + iterations = 1000 + + for name, (text, lang_info) in texts.items(): + start = time.time() + for _ in range(iterations): + handler.process_by_script(text, lang_info) + elapsed = time.time() - start + + throughput = iterations / elapsed + avg_latency = (elapsed / iterations) * 1000 # ms + + print(f"{name}:") + print(f" Throughput: {throughput:.0f} docs/sec") + print(f" Avg Latency: {avg_latency:.2f} ms") + print() + + +def main(): + """Run all examples.""" + print("\n" + "=" * 60) + print("Script-Specific Processing - Integration Examples") + print("=" * 60 + "\n") + + example_arabic_processing() + example_chinese_processing() + example_cyrillic_processing() + example_latin_processing() + example_mixed_script() + example_integration_with_task_01_2() + example_performance_benchmark() + + print("=" * 60) + print("Examples completed!") + print("=" * 60) + + +if __name__ == "__main__": + main() diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/interactive_test.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/interactive_test.py new file mode 100755 index 0000000..3630af4 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/interactive_test.py @@ -0,0 +1,201 @@ +#!/usr/bin/env python3 +""" +Interactive Testing Tool - Task 01.3 + +Interactive CLI for testing script-specific processing. +""" + +import sys +from pathlib import Path + +# Add Task 01.2 to path +task_01_2_path = Path(__file__).parent.parent / "01.2-language-detection" +if task_01_2_path.exists(): + sys.path.insert(0, str(task_01_2_path)) + try: + from text_processing import LanguageInfo, detect_language + HAS_TASK_01_2 = True + except ImportError: + HAS_TASK_01_2 = False + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] +else: + HAS_TASK_01_2 = False + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] + +from text_processing import ScriptHandler, process_by_script + + +def print_result(result): + """Print processing result.""" + print("\n" + "=" * 60) + print("Processing Result") + print("=" * 60) + print(f"Original: {result.original}") + print(f"Processed: {result.text}") + print(f"Script: {result.script_code}") + print(f"Language: {result.language_code}") + print(f"Confidence: {result.confidence:.2f}") + print(f"Rules Applied: {', '.join(result.applied_rules)}") + if result.word_boundaries: + print(f"Word Boundaries: {result.word_boundaries}") + print("=" * 60 + "\n") + + +def interactive_mode(): + """Interactive mode for testing.""" + handler = ScriptHandler() + + print("\n" + "=" * 60) + print("Script-Specific Processing - Interactive Test") + print("=" * 60) + print("\nEnter text to process (or 'quit' to exit)") + print("Examples:") + print(" - Persian: می‌خواهم") + print(" - Chinese: 你好世界") + print(" - Russian: Привет") + print(" - English: Hello World") + print(" - Mixed: Hello سلام") + print("=" * 60 + "\n") + + while True: + try: + text = input("Enter text: ").strip() + + if text.lower() in ('quit', 'exit', 'q'): + print("Goodbye!") + break + + if not text: + continue + + # Try to detect language if Task 01.2 available + if HAS_TASK_01_2: + try: + language_info = detect_language(text) + print(f"\nDetected: {language_info.language_code} ({language_info.script_code})") + except Exception as e: + print(f"\nLanguage detection failed: {e}") + print("Using default language info...") + language_info = LanguageInfo( + language_code="en", + script_code="Latn", + confidence=0.5 + ) + else: + # Manual language selection + print("\nSelect language:") + print("1. Persian (fa)") + print("2. Arabic (ar)") + print("3. Chinese (zh)") + print("4. Japanese (ja)") + print("5. Korean (ko)") + print("6. Russian (ru)") + print("7. English (en)") + print("8. French (fr)") + + choice = input("Choice (1-8, default 7): ").strip() or "7" + + lang_map = { + "1": ("fa", "Arab"), + "2": ("ar", "Arab"), + "3": ("zh", "Hans"), + "4": ("ja", "Jpan"), + "5": ("ko", "Kore"), + "6": ("ru", "Cyrl"), + "7": ("en", "Latn"), + "8": ("fr", "Latn"), + } + + lang_code, script_code = lang_map.get(choice, ("en", "Latn")) + language_info = LanguageInfo( + language_code=lang_code, + script_code=script_code, + confidence=0.95 + ) + + # Process text + try: + result = handler.process_by_script(text, language_info) + print_result(result) + except Exception as e: + print(f"\nError processing text: {e}") + import traceback + traceback.print_exc() + print() + + except KeyboardInterrupt: + print("\n\nGoodbye!") + break + except EOFError: + print("\n\nGoodbye!") + break + + +def batch_mode(texts): + """Batch mode for processing multiple texts.""" + handler = ScriptHandler() + + print("\n" + "=" * 60) + print("Batch Processing") + print("=" * 60 + "\n") + + for i, text in enumerate(texts, 1): + print(f"Text {i}: {text}") + + # Detect language if available + if HAS_TASK_01_2: + try: + language_info = detect_language(text) + except Exception: + language_info = LanguageInfo("en", "Latn", 0.5) + else: + language_info = LanguageInfo("en", "Latn", 0.5) + + # Process + try: + result = handler.process_by_script(text, language_info) + print(f" → {result.text} ({result.script_code})") + except Exception as e: + print(f" → Error: {e}") + print() + + +def main(): + """Main entry point.""" + if len(sys.argv) > 1: + # Batch mode + texts = sys.argv[1:] + batch_mode(texts) + else: + # Interactive mode + interactive_mode() + + +if __name__ == "__main__": + main() diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/pytest.ini b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/pytest.ini new file mode 100644 index 0000000..0ca85d2 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/pytest.ini @@ -0,0 +1,43 @@ +[pytest] +# Pytest configuration for script-specific processing tests + +# Test discovery +testpaths = tests +python_files = test_*.py +python_classes = Test* +python_functions = test_* + +# Output options +addopts = + -v + --strict-markers + --tb=short + --cov=text_processing + --cov-report=term-missing + --cov-report=html + +# Markers +markers = + slow: marks tests as slow (deselect with '-m "not slow"') + integration: marks tests as integration tests + requires_jieba: marks tests that require jieba + requires_mecab: marks tests that require MeCab + +# Coverage options +[coverage:run] +source = text_processing +omit = + */tests/* + */benchmarks/* + */scripts/* + */examples/* + +[coverage:report] +precision = 2 +exclude_lines = + pragma: no cover + def __repr__ + raise AssertionError + raise NotImplementedError + if __name__ == .__main__.: + if TYPE_CHECKING: diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements-dev.txt b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements-dev.txt new file mode 100644 index 0000000..d72cbdb --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements-dev.txt @@ -0,0 +1,15 @@ +# Development Dependencies - Task 01.3 Script-Specific Processing +-r requirements.txt + +# Testing +pytest>=7.4.3 +pytest-cov>=4.1.0 +pytest-benchmark>=4.0.0 +pytest-timeout>=2.2.0 + +# Code Quality +black>=23.12.0 +flake8>=6.1.0 +mypy>=1.7.1 +isort>=5.13.2 +memory-profiler>=0.61.0 diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements.txt b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements.txt new file mode 100644 index 0000000..155eb99 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/requirements.txt @@ -0,0 +1,8 @@ +# Core Dependencies - Task 01.3 Script-Specific Processing +jieba-fast>=0.53.0 # Fast Chinese word segmentation (CJK) +structlog>=23.2.0 # Structured logging +unicodedata2>=15.1.0 # Enhanced Unicode support (optional but recommended) + +# Optional dependencies for extended functionality +# mecab-python3>=1.0.6 # Japanese tokenization (requires MeCab system library) +# opencc-python-reimplemented>=1.1.6 # Traditional ↔ Simplified Chinese conversion diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/setup.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/setup.py new file mode 100644 index 0000000..6afcc80 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/setup.py @@ -0,0 +1,56 @@ +"""Setup configuration for script-specific processing package - Task 01.3""" + +from setuptools import setup, find_packages +from pathlib import Path + +# Read README +readme_file = Path(__file__).parent / "README.md" +if readme_file.exists(): + with open(readme_file, "r", encoding="utf-8") as fh: + long_description = fh.read() +else: + long_description = "Script-specific text processing for search engine" + +setup( + name="search-engine-script-processing", + version="0.1.0", + author="Search Engine Team", + description="Script-specific text processing: Arabic (ZWNJ), CJK (segmentation), Cyrillic (variants), Latin (diacritics)", + long_description=long_description, + long_description_content_type="text/markdown", + packages=find_packages(exclude=["tests", "benchmarks", "scripts", "examples"]), + python_requires=">=3.9", + install_requires=[ + "jieba-fast>=0.53.0", # Fast Chinese word segmentation + "structlog>=23.2.0", # Structured logging + "unicodedata2>=15.1.0", # Enhanced Unicode support (optional but recommended) + ], + extras_require={ + "dev": [ + "pytest>=7.4.3", + "pytest-cov>=4.1.0", + "pytest-benchmark>=4.0.0", + "pytest-timeout>=2.2.0", + "black>=23.12.0", + "flake8>=6.1.0", + "mypy>=1.7.1", + "isort>=5.13.2", + "memory-profiler>=0.61.0", + ], + "extended": [ + # Optional extended CJK support + "mecab-python3>=1.0.6", # Japanese tokenization (requires MeCab system library) + "opencc-python-reimplemented>=1.1.6", # Traditional ↔ Simplified Chinese conversion + ], + }, + classifiers=[ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Topic :: Text Processing :: Linguistic", + "Topic :: Scientific/Engineering :: Artificial Intelligence", + ], + keywords="text-processing script-processing arabic cjk cyrillic latin unicode nlp search-engine", +) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/__init__.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/__init__.py new file mode 100644 index 0000000..d06febb --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/__init__.py @@ -0,0 +1 @@ +"""Shared utilities for script-specific processing.""" diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/logger.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/logger.py new file mode 100644 index 0000000..c436bc2 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/shared/logger.py @@ -0,0 +1,61 @@ +"""Logging configuration for ml-pipeline.""" + +import logging +import sys +from typing import Optional + +import structlog + + +def setup_logger( + name: str, + level: str = "INFO", + json_format: bool = False +) -> structlog.BoundLogger: + """ + Setup structured logger for ml-pipeline components. + + Args: + name: Logger name (usually __name__) + level: Log level (DEBUG, INFO, WARNING, ERROR) + json_format: Use JSON output format (for production) + + Returns: + Configured structlog logger + """ + log_level = getattr(logging, level.upper(), logging.INFO) + + # Configure standard library logging + logging.basicConfig( + format="%(message)s", + stream=sys.stdout, + level=log_level, + ) + + # Configure structlog + processors = [ + structlog.contextvars.merge_contextvars, + structlog.processors.add_log_level, + structlog.processors.StackInfoRenderer(), + structlog.dev.set_exc_info, + structlog.processors.TimeStamper(fmt="iso"), + ] + + if json_format: + processors.append(structlog.processors.JSONRenderer()) + else: + processors.append(structlog.dev.ConsoleRenderer()) + + structlog.configure( + processors=processors, + wrapper_class=structlog.make_filtering_bound_logger(log_level), + context_class=dict, + logger_factory=structlog.PrintLoggerFactory(), + cache_logger_on_first_use=True, + ) + + return structlog.get_logger(name) + + +# Default logger +logger = setup_logger("ml-pipeline") diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/__init__.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/__init__.py new file mode 100644 index 0000000..fc9cd51 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/__init__.py @@ -0,0 +1 @@ +"""Tests for script-specific processing.""" diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/conftest.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/conftest.py new file mode 100644 index 0000000..f1a5acd --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/conftest.py @@ -0,0 +1,99 @@ +"""Pytest configuration and fixtures for script-specific processing tests.""" + +import pytest +from dataclasses import dataclass +from typing import List, Tuple + + +@dataclass +class LanguageInfo: + """Minimal LanguageInfo for testing.""" + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] + + +@pytest.fixture +def language_info_fa(): + """Persian language info.""" + return LanguageInfo( + language_code="fa", + script_code="Arab", + confidence=0.98 + ) + + +@pytest.fixture +def language_info_ar(): + """Arabic language info.""" + return LanguageInfo( + language_code="ar", + script_code="Arab", + confidence=0.95 + ) + + +@pytest.fixture +def language_info_zh(): + """Chinese language info.""" + return LanguageInfo( + language_code="zh", + script_code="Hans", + confidence=0.99 + ) + + +@pytest.fixture +def language_info_ja(): + """Japanese language info.""" + return LanguageInfo( + language_code="ja", + script_code="Jpan", + confidence=0.97 + ) + + +@pytest.fixture +def language_info_ko(): + """Korean language info.""" + return LanguageInfo( + language_code="ko", + script_code="Kore", + confidence=0.96 + ) + + +@pytest.fixture +def language_info_ru(): + """Russian language info.""" + return LanguageInfo( + language_code="ru", + script_code="Cyrl", + confidence=0.98 + ) + + +@pytest.fixture +def language_info_en(): + """English language info.""" + return LanguageInfo( + language_code="en", + script_code="Latn", + confidence=0.99 + ) + + +@pytest.fixture +def language_info_fr(): + """French language info.""" + return LanguageInfo( + language_code="fr", + script_code="Latn", + confidence=0.97 + ) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_integration.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_integration.py new file mode 100644 index 0000000..3a3f6bd --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_integration.py @@ -0,0 +1,207 @@ +""" +Integration Tests - Task 01.3 + +Tests integration with Tasks 01.1 and 01.2. +""" + +import pytest +import sys +from pathlib import Path + +# Try to import Task 01.2 +task_01_2_path = Path(__file__).parent.parent.parent / "01.2-language-detection" +if task_01_2_path.exists(): + sys.path.insert(0, str(task_01_2_path)) + try: + from text_processing import LanguageInfo, detect_language + HAS_TASK_01_2 = True + except ImportError: + HAS_TASK_01_2 = False + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] +else: + HAS_TASK_01_2 = False + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] + +from text_processing import ScriptHandler, process_by_script + + +@pytest.mark.integration +class TestTask01_2Integration: + """Test integration with Task 01.2 (Language Detection).""" + + @pytest.mark.skipif(not HAS_TASK_01_2, reason="Task 01.2 not available") + def test_detect_and_process_arabic(self): + """Test language detection + script processing for Arabic.""" + text = "می‌خواهم" + + # Detect language (Task 01.2) + language_info = detect_language(text) + + # Process by script (Task 01.3) + handler = ScriptHandler() + result = handler.process_by_script(text, language_info) + + assert result.script_code == "Arab" + assert result.language_code == "fa" + assert result.confidence == language_info.confidence + + @pytest.mark.skipif(not HAS_TASK_01_2, reason="Task 01.2 not available") + def test_detect_and_process_chinese(self): + """Test language detection + script processing for Chinese.""" + text = "你好世界" + + # Detect language (Task 01.2) + language_info = detect_language(text) + + # Process by script (Task 01.3) + handler = ScriptHandler() + result = handler.process_by_script(text, language_info) + + assert result.script_code in ("Hans", "Hant") + assert result.language_code == "zh" + + @pytest.mark.skipif(not HAS_TASK_01_2, reason="Task 01.2 not available") + def test_detect_and_process_cyrillic(self): + """Test language detection + script processing for Cyrillic.""" + text = "Привет мир" + + # Detect language (Task 01.2) + language_info = detect_language(text) + + # Process by script (Task 01.3) + handler = ScriptHandler() + result = handler.process_by_script(text, language_info) + + assert result.script_code == "Cyrl" + assert result.language_code == "ru" + + @pytest.mark.skipif(not HAS_TASK_01_2, reason="Task 01.2 not available") + def test_detect_and_process_latin(self): + """Test language detection + script processing for Latin.""" + text = "Hello World" + + # Detect language (Task 01.2) + language_info = detect_language(text) + + # Process by script (Task 01.3) + handler = ScriptHandler() + result = handler.process_by_script(text, language_info) + + assert result.script_code == "Latn" + assert result.language_code == "en" + + @pytest.mark.skipif(not HAS_TASK_01_2, reason="Task 01.2 not available") + def test_detect_and_process_mixed(self): + """Test language detection + script processing for mixed text.""" + text = "Hello سلام" + + # Detect language (Task 01.2) + language_info = detect_language(text) + + # Process mixed script (Task 01.3) + handler = ScriptHandler() + result = handler.process_mixed_script(text, language_info) + + assert result.original == text + assert len(result.applied_rules) > 0 + + +@pytest.mark.integration +class TestFullPipeline: + """Test full processing pipeline.""" + + def test_arabic_pipeline(self): + """Test full Arabic processing pipeline.""" + handler = ScriptHandler() + text = "می‌خواهم" + language_info = LanguageInfo("fa", "Arab", 0.98) + + result = handler.process_by_script(text, language_info) + + # Verify ZWNJ preserved + assert '\u200C' in result.text + assert result.script_code == "Arab" + assert "preserve_zwnj" in result.applied_rules + + def test_cjk_pipeline(self): + """Test full CJK processing pipeline.""" + handler = ScriptHandler() + text = "你好世界" + language_info = LanguageInfo("zh", "Hans", 0.99) + + try: + result = handler.process_by_script(text, language_info) + assert result.script_code == "Hans" + assert result.language_code == "zh" + except ImportError: + pytest.skip("jieba not available") + + def test_cyrillic_pipeline(self): + """Test full Cyrillic processing pipeline.""" + handler = ScriptHandler() + text = "ёлка" + language_info = LanguageInfo("ru", "Cyrl", 0.98) + + result = handler.process_by_script(text, language_info, normalize_yo=True) + assert result.script_code == "Cyrl" + assert "unify_variants" in result.applied_rules + + def test_latin_pipeline(self): + """Test full Latin processing pipeline.""" + handler = ScriptHandler() + text = "Hello World" + language_info = LanguageInfo("en", "Latn", 0.99) + + result = handler.process_by_script(text, language_info) + assert result.script_code == "Latn" + assert result.language_code == "en" + + +@pytest.mark.integration +class TestConvenienceFunctions: + """Test convenience functions work correctly.""" + + def test_process_by_script_function(self): + """Test process_by_script convenience function.""" + text = "Hello" + language_info = LanguageInfo("en", "Latn", 0.99) + + result = process_by_script(text, language_info) + assert result.text == text + assert result.script_code == "Latn" + + def test_process_mixed_script_function(self): + """Test process_mixed_script convenience function.""" + from text_processing import process_mixed_script + + text = "Hello سلام" + language_info = LanguageInfo("fa", "Arab", 0.95) + + result = process_mixed_script(text, language_info) + assert result.original == text diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_script_processing.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_script_processing.py new file mode 100644 index 0000000..19c6a40 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/tests/test_script_processing.py @@ -0,0 +1,613 @@ +""" +Comprehensive tests for script-specific processing - Task 01.3 + +Tests cover: +- Arabic script (ZWNJ preservation, diacritics, shapes) +- CJK scripts (Chinese, Japanese, Korean segmentation) +- Cyrillic script (variant unification, case folding) +- Latin script (diacritics, ligatures) +- Mixed-script handling +- Edge cases (empty strings, malformed Unicode, etc.) +- Performance requirements +""" + +import pytest +from text_processing import ( + ScriptHandler, + ProcessedText, + process_by_script, + process_mixed_script, +) +from text_processing.arabic_processor import ( + preserve_zwnj, + remove_arabic_diacritics, + normalize_arabic_shapes, + process_arabic, + ZWNJ, +) +from text_processing.cjk_processor import ( + segment_chinese, + segment_japanese, + segment_korean, + process_cjk, +) +from text_processing.cyrillic_processor import ( + unify_cyrillic_variants, + process_cyrillic, +) +from text_processing.latin_processor import ( + normalize_diacritics, + handle_ligatures, + process_latin, +) +from text_processing.script_handler import detect_script_boundaries + +from conftest import LanguageInfo + + +# ============================================================================ +# Arabic Script Tests +# ============================================================================ + +class TestArabicProcessor: + """Test Arabic script processing.""" + + def test_zwnj_preservation(self, language_info_fa): + """Test ZWNJ preservation in Persian text.""" + text = "می‌خواهم" # Contains ZWNJ + result = process_arabic(text, "fa") + + assert ZWNJ in result.text, "ZWNJ must be preserved" + assert result.text == text, "ZWNJ text should remain unchanged" + assert "preserve_zwnj" in result.applied_rules + + def test_zwnj_preservation_function(self): + """Test ZWNJ preservation function.""" + text = "می‌خواهم" + preserved = preserve_zwnj(text) + assert preserved == text + assert ZWNJ in preserved + + def test_zwnj_without_zwnj(self): + """Test text without ZWNJ.""" + text = "سلام" + preserved = preserve_zwnj(text) + assert preserved == text + + def test_remove_arabic_diacritics(self): + """Test Arabic diacritic removal.""" + text = "مَرْحَبًا" + result = remove_arabic_diacritics(text) + assert "مرحبا" in result or len(result) < len(text) + assert "remove_diacritics" not in result # No diacritics in result + + def test_preserve_diacritics(self): + """Test preserving Arabic diacritics.""" + text = "مَرْحَبًا" + result = process_arabic(text, "ar", preserve_diacritics=True) + assert "preserve_diacritics" in result.applied_rules + + def test_normalize_arabic_shapes(self): + """Test Arabic shape normalization.""" + text = "سلام" + result = normalize_arabic_shapes(text) + assert isinstance(result, str) + assert len(result) == len(text) + + def test_arabic_processing_persian(self, language_info_fa): + """Test full Arabic processing for Persian.""" + text = "می‌خواهم" + result = process_arabic(text, "fa") + + assert result.script_code == "Arab" + assert result.language_code == "fa" + assert result.original == text + assert "preserve_zwnj" in result.applied_rules + + def test_arabic_processing_arabic(self, language_info_ar): + """Test full Arabic processing for Arabic.""" + text = "مرحبا" + result = process_arabic(text, "ar") + + assert result.script_code == "Arab" + assert result.language_code == "ar" + assert result.original == text + + def test_arabic_empty_string(self): + """Test Arabic processing with empty string.""" + result = process_arabic("", "fa") + assert result.text == "" + assert result.original == "" + + def test_arabic_mixed_content(self): + """Test Arabic with mixed content.""" + text = "سلام Hello" + result = process_arabic(text, "fa") + assert result.script_code == "Arab" + + +# ============================================================================ +# CJK Script Tests +# ============================================================================ + +class TestCJKProcessor: + """Test CJK script processing.""" + + @pytest.mark.requires_jieba + def test_chinese_segmentation(self, language_info_zh): + """Test Chinese word segmentation.""" + text = "你好世界" + result = process_cjk(text, "zh", "Hans") + + assert result.script_code == "Hans" + assert result.language_code == "zh" + assert "chinese_segmentation" in result.applied_rules + assert len(result.word_boundaries) > 0 + + @pytest.mark.requires_jieba + def test_segment_chinese_function(self): + """Test Chinese segmentation function.""" + text = "你好世界" + words = segment_chinese(text) + assert isinstance(words, list) + assert len(words) > 0 + + def test_japanese_tokenization(self, language_info_ja): + """Test Japanese tokenization.""" + text = "こんにちは世界" + result = process_cjk(text, "ja", "Jpan") + + assert result.script_code == "Jpan" + assert result.language_code == "ja" + assert "japanese_tokenization" in result.applied_rules + + def test_segment_japanese_function(self): + """Test Japanese segmentation function.""" + text = "こんにちは" + words = segment_japanese(text) + assert isinstance(words, list) + assert len(words) > 0 + + def test_korean_segmentation(self, language_info_ko): + """Test Korean word segmentation.""" + text = "안녕하세요 세계" + result = process_cjk(text, "ko", "Kore") + + assert result.script_code == "Kore" + assert result.language_code == "ko" + assert "korean_segmentation" in result.applied_rules + + def test_segment_korean_function(self): + """Test Korean segmentation function.""" + text = "안녕하세요" + words = segment_korean(text) + assert isinstance(words, list) + assert len(words) > 0 + + def test_cjk_empty_string(self): + """Test CJK processing with empty string.""" + result = process_cjk("", "zh", "Hans") + assert result.text == "" + assert result.original == "" + + def test_cjk_word_boundaries(self): + """Test word boundary detection.""" + text = "你好" + words = ["你", "好"] + from text_processing.cjk_processor import get_word_boundaries + boundaries = get_word_boundaries(text, words) + assert isinstance(boundaries, list) + assert 0 in boundaries + + +# ============================================================================ +# Cyrillic Script Tests +# ============================================================================ + +class TestCyrillicProcessor: + """Test Cyrillic script processing.""" + + def test_unify_cyrillic_variants(self): + """Test Cyrillic variant unification (ё → е).""" + text = "ёлка" + result = unify_cyrillic_variants(text, normalize_yo=True) + assert "ё" not in result or "е" in result + + def test_preserve_yo(self): + """Test preserving ё character.""" + text = "ёлка" + result = unify_cyrillic_variants(text, normalize_yo=False, language_code="ru") + assert "ё" in result + + def test_cyrillic_processing(self, language_info_ru): + """Test full Cyrillic processing.""" + text = "Привет мир" + result = process_cyrillic(text, "ru") + + assert result.script_code == "Cyrl" + assert result.language_code == "ru" + assert result.original == text + + def test_cyrillic_normalize_yo(self, language_info_ru): + """Test Cyrillic with yo normalization.""" + text = "ёлка" + result = process_cyrillic(text, "ru", normalize_yo=True) + assert "unify_variants" in result.applied_rules + + def test_cyrillic_preserve_yo(self, language_info_ru): + """Test Cyrillic preserving yo.""" + text = "ёлка" + result = process_cyrillic(text, "ru", normalize_yo=False) + assert "preserve_yo" in result.applied_rules + + def test_cyrillic_empty_string(self): + """Test Cyrillic processing with empty string.""" + result = process_cyrillic("", "ru") + assert result.text == "" + assert result.original == "" + + +# ============================================================================ +# Latin Script Tests +# ============================================================================ + +class TestLatinProcessor: + """Test Latin script processing.""" + + def test_normalize_diacritics(self): + """Test diacritic normalization.""" + text = "café" + result = normalize_diacritics(text) + assert "é" not in result or "e" in result + + def test_preserve_diacritics_for_language(self): + """Test preserving diacritics for specific languages.""" + text = "café" + result = normalize_diacritics(text, preserve_for_languages=["fr"]) + assert "é" in result + + def test_handle_ligatures(self): + """Test ligature handling.""" + text = "encyclopædia" + result = handle_ligatures(text, preserve_semantic=True) + assert "æ" in result + + def test_normalize_ligatures(self): + """Test ligature normalization.""" + text = "encyclopædia" + result = handle_ligatures(text, preserve_semantic=False) + assert "ae" in result or "æ" not in result + + def test_latin_processing_english(self, language_info_en): + """Test Latin processing for English.""" + text = "Hello World" + result = process_latin(text, "en") + + assert result.script_code == "Latn" + assert result.language_code == "en" + assert result.original == text + + def test_latin_processing_french(self, language_info_fr): + """Test Latin processing for French (preserves diacritics).""" + text = "café" + result = process_latin(text, "fr", normalize_diacritics_flag=False) + assert "preserve_diacritics" in result.applied_rules + + def test_latin_empty_string(self): + """Test Latin processing with empty string.""" + result = process_latin("", "en") + assert result.text == "" + assert result.original == "" + + +# ============================================================================ +# Script Handler Tests +# ============================================================================ + +class TestScriptHandler: + """Test main script handler.""" + + def test_handler_initialization(self): + """Test handler initialization.""" + handler = ScriptHandler() + assert handler is not None + + def test_process_arabic_text(self, language_info_fa): + """Test processing Arabic text.""" + handler = ScriptHandler() + text = "می‌خواهم" + result = handler.process_by_script(text, language_info_fa) + + assert isinstance(result, ProcessedText) + assert result.script_code == "Arab" + assert result.confidence == language_info_fa.confidence + + @pytest.mark.requires_jieba + def test_process_chinese_text(self, language_info_zh): + """Test processing Chinese text.""" + handler = ScriptHandler() + text = "你好世界" + result = handler.process_by_script(text, language_info_zh) + + assert isinstance(result, ProcessedText) + assert result.script_code == "Hans" + + def test_process_cyrillic_text(self, language_info_ru): + """Test processing Cyrillic text.""" + handler = ScriptHandler() + text = "Привет" + result = handler.process_by_script(text, language_info_ru) + + assert isinstance(result, ProcessedText) + assert result.script_code == "Cyrl" + + def test_process_latin_text(self, language_info_en): + """Test processing Latin text.""" + handler = ScriptHandler() + text = "Hello World" + result = handler.process_by_script(text, language_info_en) + + assert isinstance(result, ProcessedText) + assert result.script_code == "Latn" + + def test_process_empty_string(self, language_info_en): + """Test processing empty string.""" + handler = ScriptHandler() + result = handler.process_by_script("", language_info_en) + + assert result.text == "" + assert result.original == "" + + def test_process_unknown_script(self): + """Test processing unknown script.""" + handler = ScriptHandler() + lang_info = LanguageInfo( + language_code="xx", + script_code="Xxxx", + confidence=0.5 + ) + text = "test" + result = handler.process_by_script(text, lang_info) + + assert result.text == text + assert "no_processing" in result.applied_rules + + +# ============================================================================ +# Mixed-Script Tests +# ============================================================================ + +class TestMixedScript: + """Test mixed-script processing.""" + + def test_detect_script_boundaries(self): + """Test script boundary detection.""" + text = "Hello سلام" + boundaries = detect_script_boundaries(text) + + assert len(boundaries) >= 1 + assert all(isinstance(b, tuple) and len(b) == 3 for b in boundaries) + + def test_process_mixed_script(self, language_info_fa): + """Test processing mixed-script text.""" + handler = ScriptHandler() + text = "Hello سلام World" + result = handler.process_mixed_script(text, language_info_fa) + + assert isinstance(result, ProcessedText) + assert len(result.applied_rules) > 0 + + def test_mixed_arabic_latin(self, language_info_fa): + """Test Arabic-Latin mixed text.""" + handler = ScriptHandler() + text = "Hello سلام" + result = handler.process_mixed_script(text, language_info_fa) + + assert result.original == text + assert len(result.applied_rules) > 0 + + def test_mixed_cjk_latin(self, language_info_zh): + """Test CJK-Latin mixed text.""" + handler = ScriptHandler() + text = "Hello 你好" + result = handler.process_mixed_script(text, language_info_zh) + + assert result.original == text + + def test_bidirectional_text(self, language_info_fa): + """Test bidirectional text handling.""" + handler = ScriptHandler() + text = "Hello سلام World" + result = handler.process_mixed_script(text, language_info_fa) + + assert isinstance(result, ProcessedText) + + +# ============================================================================ +# Convenience Function Tests +# ============================================================================ + +class TestConvenienceFunctions: + """Test convenience functions.""" + + def test_process_by_script_function(self, language_info_en): + """Test process_by_script convenience function.""" + result = process_by_script("Hello", language_info_en) + assert isinstance(result, ProcessedText) + + def test_process_mixed_script_function(self, language_info_fa): + """Test process_mixed_script convenience function.""" + result = process_mixed_script("Hello سلام", language_info_fa) + assert isinstance(result, ProcessedText) + + +# ============================================================================ +# Edge Cases Tests +# ============================================================================ + +class TestEdgeCases: + """Test edge cases and error handling.""" + + def test_empty_string_all_scripts(self): + """Test empty string for all script types.""" + scripts = ["Arab", "Latn", "Cyrl", "Hans", "Jpan", "Kore"] + handler = ScriptHandler() + + for script in scripts: + lang_info = LanguageInfo( + language_code="xx", + script_code=script, + confidence=0.5 + ) + result = handler.process_by_script("", lang_info) + assert result.text == "" + assert result.original == "" + + def test_whitespace_only(self, language_info_en): + """Test whitespace-only text.""" + handler = ScriptHandler() + result = handler.process_by_script(" ", language_info_en) + assert isinstance(result, ProcessedText) + + def test_numbers_and_punctuation(self, language_info_en): + """Test text with numbers and punctuation.""" + handler = ScriptHandler() + text = "Hello 123! World." + result = handler.process_by_script(text, language_info_en) + assert isinstance(result, ProcessedText) + + def test_unicode_surrogates(self, language_info_en): + """Test handling of Unicode surrogates.""" + handler = ScriptHandler() + # Try to create text with potential issues + text = "Hello" + result = handler.process_by_script(text, language_info_en) + assert isinstance(result, ProcessedText) + + def test_very_long_text(self, language_info_en): + """Test very long text.""" + handler = ScriptHandler() + text = "Hello " * 1000 + result = handler.process_by_script(text, language_info_en) + assert isinstance(result, ProcessedText) + assert len(result.text) > 0 + + def test_special_characters(self, language_info_en): + """Test text with special characters.""" + handler = ScriptHandler() + text = "Hello @#$%^&*() World" + result = handler.process_by_script(text, language_info_en) + assert isinstance(result, ProcessedText) + + +# ============================================================================ +# Integration Tests +# ============================================================================ + +class TestIntegration: + """Test integration scenarios.""" + + def test_full_pipeline_arabic(self, language_info_fa): + """Test full processing pipeline for Arabic.""" + handler = ScriptHandler() + text = "می‌خواهم" + result = handler.process_by_script(text, language_info_fa) + + assert result.text == text # ZWNJ preserved + assert result.confidence == language_info_fa.confidence + assert "preserve_zwnj" in result.applied_rules + + @pytest.mark.requires_jieba + def test_full_pipeline_chinese(self, language_info_zh): + """Test full processing pipeline for Chinese.""" + handler = ScriptHandler() + text = "你好世界" + result = handler.process_by_script(text, language_info_zh) + + assert result.script_code == "Hans" + assert len(result.word_boundaries) > 0 + + def test_full_pipeline_cyrillic(self, language_info_ru): + """Test full processing pipeline for Cyrillic.""" + handler = ScriptHandler() + text = "Привет мир" + result = handler.process_by_script(text, language_info_ru) + + assert result.script_code == "Cyrl" + assert result.language_code == "ru" + + +# ============================================================================ +# Performance Tests +# ============================================================================ + +class TestPerformance: + """Test performance requirements.""" + + @pytest.mark.slow + def test_throughput_requirement(self, language_info_en): + """Test 1000+ docs/sec throughput requirement.""" + import time + handler = ScriptHandler() + texts = ["Hello World"] * 1000 + + start = time.time() + for text in texts: + handler.process_by_script(text, language_info_en) + elapsed = time.time() - start + + throughput = len(texts) / elapsed + assert throughput >= 1000, f"Throughput {throughput:.0f} docs/sec < 1000 docs/sec" + + def test_latency_requirement(self, language_info_en): + """Test <10ms latency requirement.""" + import time + handler = ScriptHandler() + text = "Hello World" + + start = time.time() + handler.process_by_script(text, language_info_en) + elapsed = (time.time() - start) * 1000 # Convert to ms + + assert elapsed < 10, f"Latency {elapsed:.2f}ms >= 10ms" + + +# ============================================================================ +# Data Structure Tests +# ============================================================================ + +class TestProcessedText: + """Test ProcessedText dataclass.""" + + def test_processed_text_creation(self): + """Test ProcessedText object creation.""" + result = ProcessedText( + text="processed", + original="original", + script_code="Latn", + language_code="en", + applied_rules=["rule1"], + word_boundaries=[0, 5], + confidence=0.95 + ) + + assert result.text == "processed" + assert result.original == "original" + assert result.script_code == "Latn" + assert result.language_code == "en" + assert "rule1" in result.applied_rules + assert result.word_boundaries == [0, 5] + assert result.confidence == 0.95 + + def test_processed_text_defaults(self): + """Test ProcessedText with defaults.""" + result = ProcessedText( + text="test", + original="test", + script_code="Latn", + language_code="en" + ) + + assert result.applied_rules == [] + assert result.word_boundaries == [] + assert result.confidence == 0.0 diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/__init__.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/__init__.py new file mode 100644 index 0000000..650eacc --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/__init__.py @@ -0,0 +1,56 @@ +""" +Script-Specific Processing - Task 01.3 + +Processes text based on detected script codes from Task 01.2. +Handles Arabic (ZWNJ preservation), CJK (word segmentation), +Cyrillic (variant unification), Latin (diacritic handling), +and mixed-script scenarios. + +Main exports: +- ScriptHandler: Main orchestrator class +- ProcessedText: Processing result dataclass +- process_by_script: Convenience function +""" + +from dataclasses import dataclass, field +from typing import List + +# Define ProcessedText first to avoid circular imports +@dataclass +class ProcessedText: + """ + Script-processed text result with metadata. + + Attributes: + text: Script-processed text + original: Original text before processing + script_code: ISO 15924 script code (e.g., "Arab", "Latn", "Hans") + language_code: ISO 639-1 language code (e.g., "fa", "en", "zh") + applied_rules: List of processing rules applied + word_boundaries: Character positions of word boundaries (for CJK) + confidence: Language detection confidence (from LanguageInfo) + """ + text: str + original: str + script_code: str + language_code: str + applied_rules: List[str] = field(default_factory=list) + word_boundaries: List[int] = field(default_factory=list) + confidence: float = 0.0 + + +# Import handlers after ProcessedText is defined +from .script_handler import ( + ScriptHandler, + process_by_script, + process_mixed_script, +) + +__all__ = [ + 'ScriptHandler', + 'ProcessedText', + 'process_by_script', + 'process_mixed_script', +] + +__version__ = '0.1.0' diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/arabic_processor.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/arabic_processor.py new file mode 100644 index 0000000..aee5f74 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/arabic_processor.py @@ -0,0 +1,209 @@ +""" +Arabic Script Processor - Task 01.3 + +Handles Arabic, Persian, and Urdu text processing: +- ZWNJ (Zero-Width Non-Joiner) preservation (critical for Persian) +- Diacritic removal +- Character shape normalization +""" + +import re +import unicodedata +from typing import List + +from shared.logger import setup_logger +from typing import TYPE_CHECKING + +# Import ProcessedText - avoid circular import +if TYPE_CHECKING: + from .__init__ import ProcessedText +else: + from . import ProcessedText + +logger = setup_logger(__name__) + +# ZWNJ character (U+200C) - Zero-Width Non-Joiner +ZWNJ = '\u200C' + +# Arabic diacritics (tashkeel) +ARABIC_DIACRITICS = { + '\u064B', # Fathatan + '\u064C', # Dammatan + '\u064D', # Kasratan + '\u064E', # Fatha + '\u064F', # Damma + '\u0650', # Kasra + '\u0651', # Shadda + '\u0652', # Sukun + '\u0653', # Maddah + '\u0654', # Hamza Above + '\u0655', # Hamza Below + '\u0656', # Subscript Alef + '\u0657', # Inverted Damma + '\u0658', # Mark Noon Ghunna + '\u0659', # Zwarakay + '\u065A', # Vowel Sign Small V + '\u065B', # Vowel Sign Inverted Small V + '\u065C', # Vowel Sign Dot Below + '\u065D', # Reversed Damma + '\u065E', # Fatha With Two Dots + '\u0670', # Superscript Alef +} + +# Arabic script range +ARABIC_RANGE = re.compile(r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]') + + +def preserve_zwnj(text: str) -> str: + """ + Preserve ZWNJ characters in text (critical for Persian grammar). + + ZWNJ (U+200C) is grammatically significant in Persian: + - "می‌خواهم" (I want) vs "میخواهم" (different meaning) + - Must never be removed or normalized + + Args: + text: Input text + + Returns: + Text with ZWNJ preserved + """ + # ZWNJ is already in the text, just ensure it's not removed + # This function serves as documentation and validation + zwnj_count = text.count(ZWNJ) + if zwnj_count > 0: + logger.debug(f"Preserving {zwnj_count} ZWNJ characters in text") + return text + + +def remove_arabic_diacritics(text: str) -> str: + """ + Remove Arabic diacritics (tashkeel) from text. + + Args: + text: Input text + + Returns: + Text with diacritics removed + """ + result = [] + removed_count = 0 + + for char in text: + if char in ARABIC_DIACRITICS: + removed_count += 1 + continue + result.append(char) + + if removed_count > 0: + logger.debug(f"Removed {removed_count} Arabic diacritics") + + return ''.join(result) + + +def normalize_arabic_shapes(text: str) -> str: + """ + Normalize Arabic character shapes (isolated/initial/medial/final). + + Converts contextual forms to their isolated equivalents for consistency. + This helps with search and indexing. + + Args: + text: Input text + + Returns: + Text with normalized Arabic shapes + """ + # Arabic characters have contextual forms: + # - Isolated: standalone + # - Initial: at start of word + # - Medial: in middle of word + # - Final: at end of word + + # Use Unicode normalization to convert to isolated forms + # NFC normalization helps with some cases, but we need more specific handling + + normalized = [] + for char in text: + # Check if character is Arabic + if ARABIC_RANGE.match(char): + # Use Unicode name to identify contextual forms + try: + name = unicodedata.name(char, '') + # If it's a contextual form, try to normalize + # For now, we preserve the character as-is + # Full normalization would require Arabic shaping library + normalized.append(char) + except ValueError: + normalized.append(char) + else: + normalized.append(char) + + # For production, consider using python-arabic-reshaper or similar + # For now, we preserve original shapes + return ''.join(normalized) + + +def process_arabic( + text: str, + language_code: str, + preserve_diacritics: bool = False, + normalize_shapes: bool = True +) -> ProcessedText: + """ + Process Arabic script text (Arabic, Persian, Urdu). + + Args: + text: Input text + language_code: ISO 639-1 language code (ar, fa, ur, etc.) + preserve_diacritics: If True, keep Arabic diacritics + normalize_shapes: If True, normalize character shapes + + Returns: + ProcessedText with processed text and metadata + """ + if not text: + return ProcessedText( + text="", + original="", + script_code="Arab", + language_code=language_code, + applied_rules=[], + confidence=0.0 + ) + + original_text = text + applied_rules = [] + + # Step 1: Preserve ZWNJ (CRITICAL - never remove) + text = preserve_zwnj(text) + applied_rules.append("preserve_zwnj") + + # Step 2: Remove diacritics (unless preserving) + if not preserve_diacritics: + text = remove_arabic_diacritics(text) + applied_rules.append("remove_diacritics") + else: + applied_rules.append("preserve_diacritics") + + # Step 3: Normalize character shapes + if normalize_shapes: + text = normalize_arabic_shapes(text) + applied_rules.append("normalize_shapes") + + logger.debug( + f"Processed Arabic text", + language=language_code, + original_length=len(original_text), + processed_length=len(text), + rules_applied=applied_rules + ) + + return ProcessedText( + text=text, + original=original_text, + script_code="Arab", + language_code=language_code, + applied_rules=applied_rules, + confidence=0.0 # Will be set by caller + ) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cjk_processor.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cjk_processor.py new file mode 100644 index 0000000..5003909 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cjk_processor.py @@ -0,0 +1,272 @@ +""" +CJK Script Processor - Task 01.3 + +Handles Chinese, Japanese, and Korean text processing: +- Chinese: jieba word segmentation +- Japanese: Tokenization (MeCab optional) +- Korean: Hangul syllable handling and word boundaries +""" + +import re +from typing import List, Optional + +from shared.logger import setup_logger +from typing import TYPE_CHECKING + +# Import ProcessedText - avoid circular import +if TYPE_CHECKING: + from .__init__ import ProcessedText +else: + from . import ProcessedText + +logger = setup_logger(__name__) + +# Lazy load jieba (heavy dependency) +_jieba = None +_mecab = None + + +def _get_jieba(): + """Lazy load jieba for Chinese segmentation.""" + global _jieba + if _jieba is None: + try: + import jieba_fast as jieba + _jieba = jieba + logger.info("Loaded jieba-fast for Chinese segmentation") + except ImportError: + try: + import jieba + _jieba = jieba + logger.warning("Using jieba (slow) instead of jieba-fast") + except ImportError: + logger.error("jieba not available - Chinese segmentation disabled") + raise ImportError("jieba or jieba-fast required for Chinese processing") + return _jieba + + +def _get_mecab(): + """Lazy load MeCab for Japanese tokenization (optional).""" + global _mecab + if _mecab is None: + try: + import MeCab + _mecab = MeCab.Tagger("-Owakati") + logger.info("Loaded MeCab for Japanese tokenization") + except ImportError: + logger.debug("MeCab not available - using regex-based Japanese tokenization") + _mecab = False # Mark as unavailable + return _mecab + + +def segment_chinese(text: str) -> List[str]: + """ + Segment Chinese text into words using jieba. + + Args: + text: Chinese text + + Returns: + List of segmented words + """ + jieba = _get_jieba() + words = jieba.cut(text, cut_all=False) + return list(words) + + +def segment_japanese(text: str) -> List[str]: + """ + Segment Japanese text into tokens. + + Uses MeCab if available, otherwise falls back to regex-based segmentation. + + Args: + text: Japanese text + + Returns: + List of tokens + """ + mecab = _get_mecab() + + if mecab and mecab is not False: + # Use MeCab for accurate tokenization + try: + tokens = mecab.parse(text).strip().split() + return tokens + except Exception as e: + logger.warning(f"MeCab tokenization failed: {e}, falling back to regex") + + # Fallback: regex-based segmentation + # Split by Hiragana, Katakana, Kanji boundaries + # This is a simple heuristic and not as accurate as MeCab + tokens = [] + current_token = [] + + for char in text: + # Check script type + if '\u3040' <= char <= '\u309F': # Hiragana + if current_token and not any('\u3040' <= c <= '\u309F' for c in current_token): + tokens.append(''.join(current_token)) + current_token = [char] + else: + current_token.append(char) + elif '\u30A0' <= char <= '\u30FF': # Katakana + if current_token and not any('\u30A0' <= c <= '\u30FF' for c in current_token): + tokens.append(''.join(current_token)) + current_token = [char] + else: + current_token.append(char) + elif '\u4E00' <= char <= '\u9FAF': # CJK Unified Ideographs (Kanji) + if current_token and not any('\u4E00' <= c <= '\u9FAF' for c in current_token): + tokens.append(''.join(current_token)) + current_token = [char] + else: + current_token.append(char) + else: + # Punctuation, spaces, etc. + if current_token: + tokens.append(''.join(current_token)) + current_token = [] + if not char.isspace(): + tokens.append(char) + + if current_token: + tokens.append(''.join(current_token)) + + return [t for t in tokens if t.strip()] + + +def segment_korean(text: str) -> List[str]: + """ + Segment Korean text into words. + + Korean uses spaces for word boundaries, but we also handle + Hangul syllables and compound words. + + Args: + text: Korean text + + Returns: + List of words/syllables + """ + # Korean typically uses spaces for word boundaries + # Split by spaces first + words = text.split() + + # Further segment compound words if needed + # For now, we keep space-separated words + # More sophisticated segmentation could use KoNLPy or similar + + return words + + +def get_word_boundaries(text: str, words: List[str]) -> List[int]: + """ + Get character positions of word boundaries from segmented words. + + Args: + text: Original text + words: List of segmented words + + Returns: + List of character positions marking word boundaries + """ + boundaries = [0] + current_pos = 0 + + for word in words: + # Find word in text starting from current position + pos = text.find(word, current_pos) + if pos != -1: + boundaries.append(pos + len(word)) + current_pos = pos + len(word) + else: + # Word not found, advance by word length + current_pos += len(word) + boundaries.append(current_pos) + + # Remove duplicates and sort + boundaries = sorted(set(boundaries)) + + return boundaries + + +def process_cjk( + text: str, + language_code: str, + script_code: str +) -> ProcessedText: + """ + Process CJK script text (Chinese, Japanese, Korean). + + Args: + text: Input text + language_code: ISO 639-1 language code (zh, ja, ko) + script_code: ISO 15924 script code (Hans, Hant, Jpan, Kore) + + Returns: + ProcessedText with segmented text and word boundaries + """ + if not text: + return ProcessedText( + text="", + original="", + script_code=script_code, + language_code=language_code, + applied_rules=[], + word_boundaries=[], + confidence=0.0 + ) + + original_text = text + applied_rules = [] + words = [] + word_boundaries = [] + + # Process based on language + if language_code == "zh" or script_code in ("Hans", "Hant"): + # Chinese segmentation + words = segment_chinese(text) + applied_rules.append("chinese_segmentation") + word_boundaries = get_word_boundaries(text, words) + + elif language_code == "ja" or script_code == "Jpan": + # Japanese tokenization + words = segment_japanese(text) + applied_rules.append("japanese_tokenization") + word_boundaries = get_word_boundaries(text, words) + + elif language_code == "ko" or script_code == "Kore": + # Korean word segmentation + words = segment_korean(text) + applied_rules.append("korean_segmentation") + word_boundaries = get_word_boundaries(text, words) + + else: + # Unknown CJK language, try Chinese as default + logger.warning(f"Unknown CJK language {language_code}, using Chinese segmentation") + words = segment_chinese(text) + applied_rules.append("chinese_segmentation_fallback") + word_boundaries = get_word_boundaries(text, words) + + # Join words with spaces for processed text + processed_text = ' '.join(words) + + logger.debug( + f"Processed CJK text", + language=language_code, + script=script_code, + original_length=len(original_text), + word_count=len(words), + rules_applied=applied_rules + ) + + return ProcessedText( + text=processed_text, + original=original_text, + script_code=script_code, + language_code=language_code, + applied_rules=applied_rules, + word_boundaries=word_boundaries, + confidence=0.0 # Will be set by caller + ) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cyrillic_processor.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cyrillic_processor.py new file mode 100644 index 0000000..e88f306 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/cyrillic_processor.py @@ -0,0 +1,141 @@ +""" +Cyrillic Script Processor - Task 01.3 + +Handles Cyrillic script text processing: +- Variant unification (ё → е, configurable) +- Case folding +- Language-specific character preservation +""" + +from typing import Set + +from shared.logger import setup_logger +from typing import TYPE_CHECKING + +# Import ProcessedText - avoid circular import +if TYPE_CHECKING: + from .__init__ import ProcessedText +else: + from . import ProcessedText + +logger = setup_logger(__name__) + +# Cyrillic ё (U+0451) and е (U+0435) +CYRILLIC_YO = '\u0451' # ё +CYRILLIC_E = '\u0435' # е + +# Languages that should preserve ё +PRESERVE_YO_LANGUAGES: Set[str] = { + 'ru', # Russian (though normalization is common) + # Add other languages if needed +} + +# Ukrainian/Belarusian specific characters that should be preserved +UKRAINIAN_SPECIFIC = { + '\u0456', # і (Ukrainian/Belarusian i) + '\u0457', # ї (Ukrainian yi) + '\u0491', # ґ (Ukrainian ghe) +} + +BELARUSIAN_SPECIFIC = { + '\u0456', # і (also Belarusian) +} + + +def unify_cyrillic_variants( + text: str, + normalize_yo: bool = True, + language_code: str = "" +) -> str: + """ + Unify Cyrillic character variants. + + Main operation: ё → е normalization (configurable). + Preserves language-specific characters for Ukrainian/Belarusian. + + Args: + text: Input text + normalize_yo: If True, normalize ё to е + language_code: ISO 639-1 language code (ru, uk, be, etc.) + + Returns: + Text with variants unified + """ + if not text: + return text + + result = [] + normalized_count = 0 + + # Check if we should preserve ё for this language + should_preserve_yo = language_code in PRESERVE_YO_LANGUAGES and not normalize_yo + + for char in text: + # Normalize ё to е (unless preserving) + if char == CYRILLIC_YO and not should_preserve_yo: + result.append(CYRILLIC_E) + normalized_count += 1 + else: + result.append(char) + + if normalized_count > 0: + logger.debug(f"Normalized {normalized_count} ё → е") + + return ''.join(result) + + +def process_cyrillic( + text: str, + language_code: str, + normalize_yo: bool = True +) -> ProcessedText: + """ + Process Cyrillic script text. + + Args: + text: Input text + language_code: ISO 639-1 language code (ru, uk, be, bg, etc.) + normalize_yo: If True, normalize ё to е + + Returns: + ProcessedText with processed text and metadata + """ + if not text: + return ProcessedText( + text="", + original="", + script_code="Cyrl", + language_code=language_code, + applied_rules=[], + confidence=0.0 + ) + + original_text = text + applied_rules = [] + + # Step 1: Unify variants (ё → е) + if normalize_yo: + text = unify_cyrillic_variants(text, normalize_yo=True, language_code=language_code) + applied_rules.append("unify_variants") + else: + applied_rules.append("preserve_yo") + + # Step 2: Case folding (Unicode case folding handles Cyrillic properly) + # We don't need special handling here - standard case folding works + + logger.debug( + f"Processed Cyrillic text", + language=language_code, + original_length=len(original_text), + processed_length=len(text), + rules_applied=applied_rules + ) + + return ProcessedText( + text=text, + original=original_text, + script_code="Cyrl", + language_code=language_code, + applied_rules=applied_rules, + confidence=0.0 # Will be set by caller + ) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/latin_processor.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/latin_processor.py new file mode 100644 index 0000000..33012a5 --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/latin_processor.py @@ -0,0 +1,220 @@ +""" +Latin Script Processor - Task 01.3 + +Handles Latin script text processing: +- Diacritic normalization (é → e, configurable) +- Ligature handling (æ, œ semantic preservation) +- Case folding +""" + +import unicodedata +from typing import List, Set + +from shared.logger import setup_logger +from typing import TYPE_CHECKING + +# Import ProcessedText - avoid circular import +if TYPE_CHECKING: + from .__init__ import ProcessedText +else: + from . import ProcessedText + +logger = setup_logger(__name__) + +# Languages that should preserve diacritics +PRESERVE_DIACRITICS_LANGUAGES: Set[str] = { + 'fr', # French (é, è, ê, etc.) + 'es', # Spanish (ñ, á, é, etc.) + 'pt', # Portuguese (ã, ç, etc.) + 'de', # German (ä, ö, ü, ß) + 'it', # Italian (à, è, ì, ò, ù) + 'pl', # Polish (ą, ć, ę, ł, ń, ó, ś, ź, ż) + 'cs', # Czech (á, č, ď, é, ě, í, ň, ó, ř, š, ť, ú, ů, ý, ž) + 'sk', # Slovak (similar to Czech) + 'hu', # Hungarian (á, é, í, ó, ö, ő, ú, ü, ű) + 'ro', # Romanian (ă, â, î, ș, ț) + 'tr', # Turkish (ç, ğ, ı, ö, ş, ü) + 'vi', # Vietnamese (extensive diacritics) + 'is', # Icelandic (á, é, í, ó, ú, ý, þ, æ, ö) + 'da', # Danish (æ, ø, å) + 'no', # Norwegian (æ, ø, å) + 'sv', # Swedish (ä, ö, å) + 'fi', # Finnish (ä, ö) + 'et', # Estonian (ä, ö, õ, ü) + 'lv', # Latvian (ā, č, ē, ģ, ī, ķ, ļ, ņ, ō, ŗ, š, ū, ž) + 'lt', # Lithuanian (ą, č, ę, ė, į, š, ų, ū, ž) +} + +# Semantic ligatures that should be preserved +SEMANTIC_LIGATURES = { + 'æ': 'ae', # Latin ligature (e.g., "encyclopædia") + 'œ': 'oe', # Latin ligature (e.g., "cœur" in French) + 'Æ': 'AE', + 'Œ': 'OE', +} + +# Non-semantic ligatures (can be normalized) +NON_SEMANTIC_LIGATURES = { + 'fi': 'fi', + 'fl': 'fl', + 'ff': 'ff', + 'ffi': 'ffi', + 'ffl': 'ffl', + 'ſt': 'st', +} + + +def normalize_diacritics( + text: str, + preserve_for_languages: List[str] = None +) -> str: + """ + Normalize diacritics in Latin text (é → e, ñ → n, etc.). + + Args: + text: Input text + preserve_for_languages: List of language codes to preserve diacritics for + + Returns: + Text with diacritics normalized (or preserved based on language) + """ + if not text: + return text + + preserve_set = set(preserve_for_languages) if preserve_for_languages else set() + + # Check if we should preserve diacritics + should_preserve = bool(preserve_set) + + if should_preserve: + # Preserve diacritics for specified languages + return text + + # Normalize diacritics using Unicode NFD + remove combining marks + normalized = [] + removed_count = 0 + + for char in text: + # Decompose character (é → e + combining acute) + decomposed = unicodedata.normalize('NFD', char) + + # Keep base character, remove combining marks + base_chars = [] + for c in decomposed: + category = unicodedata.category(c) + if category != 'Mn': # Mn = Nonspacing Mark (diacritics) + base_chars.append(c) + + if len(base_chars) < len(decomposed): + removed_count += 1 + + normalized.extend(base_chars) + + if removed_count > 0: + logger.debug(f"Normalized {removed_count} diacritics") + + return ''.join(normalized) + + +def handle_ligatures(text: str, preserve_semantic: bool = True) -> str: + """ + Handle ligatures in Latin text. + + Args: + text: Input text + preserve_semantic: If True, preserve semantic ligatures (æ, œ) + + Returns: + Text with ligatures handled + """ + if not text: + return text + + result = [] + normalized_count = 0 + + for char in text: + if char in SEMANTIC_LIGATURES: + if preserve_semantic: + # Preserve semantic ligatures + result.append(char) + else: + # Normalize semantic ligatures + result.append(SEMANTIC_LIGATURES[char]) + normalized_count += 1 + elif char in NON_SEMANTIC_LIGATURES: + # Always normalize non-semantic ligatures + result.append(NON_SEMANTIC_LIGATURES[char]) + normalized_count += 1 + else: + result.append(char) + + if normalized_count > 0: + logger.debug(f"Normalized {normalized_count} ligatures") + + return ''.join(result) + + +def process_latin( + text: str, + language_code: str, + normalize_diacritics_flag: bool = False, + preserve_semantic_ligatures: bool = True +) -> ProcessedText: + """ + Process Latin script text. + + Args: + text: Input text + language_code: ISO 639-1 language code (en, fr, es, etc.) + normalize_diacritics_flag: If True, normalize diacritics (é → e) + preserve_semantic_ligatures: If True, preserve æ, œ ligatures + + Returns: + ProcessedText with processed text and metadata + """ + if not text: + return ProcessedText( + text="", + original="", + script_code="Latn", + language_code=language_code, + applied_rules=[], + confidence=0.0 + ) + + original_text = text + applied_rules = [] + + # Step 1: Handle ligatures + text = handle_ligatures(text, preserve_semantic=preserve_semantic_ligatures) + applied_rules.append("handle_ligatures") + + # Step 2: Normalize diacritics (if requested and language doesn't require preservation) + should_preserve = language_code in PRESERVE_DIACRITICS_LANGUAGES + + if normalize_diacritics_flag and not should_preserve: + text = normalize_diacritics(text) + applied_rules.append("normalize_diacritics") + else: + applied_rules.append("preserve_diacritics") + + # Step 3: Case folding (Unicode handles Latin properly) + # No special handling needed + + logger.debug( + f"Processed Latin text", + language=language_code, + original_length=len(original_text), + processed_length=len(text), + rules_applied=applied_rules + ) + + return ProcessedText( + text=text, + original=original_text, + script_code="Latn", + language_code=language_code, + applied_rules=applied_rules, + confidence=0.0 # Will be set by caller + ) diff --git a/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/script_handler.py b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/script_handler.py new file mode 100644 index 0000000..bf3925f --- /dev/null +++ b/modules/M0-foundation/01-text-processing/01.3-script-specific-processing/text_processing/script_handler.py @@ -0,0 +1,358 @@ +""" +Script Handler - Task 01.3 + +Main orchestrator for script-specific processing. +Routes text to appropriate processor based on script code. +Handles mixed-script text and bidirectional text. +""" + +import re +import sys +from pathlib import Path +from typing import List, Tuple, Optional, TYPE_CHECKING + +from shared.logger import setup_logger + +from .arabic_processor import process_arabic +from .cjk_processor import process_cjk +from .cyrillic_processor import process_cyrillic +from .latin_processor import process_latin + +# Import ProcessedText - avoid circular import by importing from __init__ after it's defined +if TYPE_CHECKING: + from .__init__ import ProcessedText +else: + # Runtime import - ProcessedText is defined in __init__.py before this import + from . import ProcessedText + +logger = setup_logger(__name__) + +# Try to import LanguageInfo from Task 01.2 +try: + task_01_2_path = Path(__file__).parent.parent.parent.parent / "01.2-language-detection" + if task_01_2_path.exists(): + sys.path.insert(0, str(task_01_2_path)) + from text_processing import LanguageInfo + else: + # Fallback: define minimal LanguageInfo if Task 01.2 not available + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] +except ImportError: + # Fallback: define minimal LanguageInfo + from dataclasses import dataclass + from typing import List, Tuple + + @dataclass + class LanguageInfo: + language_code: str + script_code: str + confidence: float + is_mixed_content: bool = False + detected_languages: List[Tuple[str, float]] = None + + def __post_init__(self): + if self.detected_languages is None: + self.detected_languages = [] + + +# Script detection regex patterns +SCRIPT_PATTERNS = { + 'Arab': re.compile(r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]'), + 'Latn': re.compile(r'[a-zA-Z]'), + 'Cyrl': re.compile(r'[\u0400-\u04FF]'), + 'Hans': re.compile(r'[\u4E00-\u9FFF]'), + 'Hant': re.compile(r'[\u4E00-\u9FFF]'), + 'Jpan': re.compile(r'[\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FFF]'), # Hiragana, Katakana, Kanji + 'Kore': re.compile(r'[\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F]'), # Hangul +} + + +def detect_script_boundaries(text: str) -> List[Tuple[int, int, str]]: + """ + Detect script boundaries in mixed-script text. + + Args: + text: Input text + + Returns: + List of (start, end, script_code) tuples + """ + if not text: + return [] + + boundaries = [] + current_script = None + start_pos = 0 + + i = 0 + while i < len(text): + char = text[i] + + # Skip whitespace and punctuation + if char.isspace() or not char.isalnum(): + i += 1 + continue + + # Detect script for this character + detected_script = None + for script_code, pattern in SCRIPT_PATTERNS.items(): + if pattern.match(char): + detected_script = script_code + break + + # If no script detected, use "Zyyy" (Common) + if detected_script is None: + detected_script = "Zyyy" + + # If script changed, save previous segment + if current_script is not None and detected_script != current_script: + if i > start_pos: + boundaries.append((start_pos, i, current_script)) + start_pos = i + current_script = detected_script + elif current_script is None: + current_script = detected_script + start_pos = i + + i += 1 + + # Add final segment + if current_script is not None and start_pos < len(text): + boundaries.append((start_pos, len(text), current_script)) + + return boundaries + + +def handle_bidirectional_text(text: str) -> str: + """ + Handle bidirectional text (RTL + LTR mixing). + + Uses Unicode BiDi algorithm implicitly through proper text handling. + For now, we preserve the logical order. + + Args: + text: Input text + + Returns: + Text with bidirectional handling applied + """ + # Unicode bidirectional algorithm is handled by the rendering system + # We preserve logical order here + # For more sophisticated handling, consider using python-bidi library + return text + + +class ScriptHandler: + """ + Main handler for script-specific text processing. + """ + + def __init__(self): + """Initialize script handler.""" + logger.info("Initialized ScriptHandler") + + def process_by_script( + self, + text: str, + language_info: LanguageInfo, + **kwargs + ) -> ProcessedText: + """ + Process text based on script code from LanguageInfo. + + Args: + text: Input text + language_info: LanguageInfo from Task 01.2 + **kwargs: Additional processor-specific options + + Returns: + ProcessedText with processed text and metadata + """ + if not text: + return ProcessedText( + text="", + original="", + script_code=language_info.script_code, + language_code=language_info.language_code, + applied_rules=[], + confidence=language_info.confidence + ) + + script_code = language_info.script_code + language_code = language_info.language_code + + # Route to appropriate processor + if script_code == "Arab": + result = process_arabic( + text, + language_code, + preserve_diacritics=kwargs.get('preserve_diacritics', False), + normalize_shapes=kwargs.get('normalize_shapes', True) + ) + elif script_code in ("Hans", "Hant", "Jpan", "Kore"): + result = process_cjk(text, language_code, script_code) + elif script_code == "Cyrl": + result = process_cyrillic( + text, + language_code, + normalize_yo=kwargs.get('normalize_yo', True) + ) + elif script_code == "Latn": + result = process_latin( + text, + language_code, + normalize_diacritics_flag=kwargs.get('normalize_diacritics', False), + preserve_semantic_ligatures=kwargs.get('preserve_semantic_ligatures', True) + ) + else: + # Unknown script - return as-is + logger.warning(f"Unknown script code: {script_code}, returning text as-is") + result = ProcessedText( + text=text, + original=text, + script_code=script_code, + language_code=language_code, + applied_rules=["no_processing"], + confidence=language_info.confidence + ) + + # Set confidence from language_info + result.confidence = language_info.confidence + + return result + + def process_mixed_script( + self, + text: str, + language_info: LanguageInfo, + **kwargs + ) -> ProcessedText: + """ + Process mixed-script text by detecting boundaries and processing each segment. + + Args: + text: Input text (may contain multiple scripts) + language_info: LanguageInfo from Task 01.2 + **kwargs: Additional processor-specific options + + Returns: + ProcessedText with processed text and metadata + """ + if not text: + return ProcessedText( + text="", + original="", + script_code=language_info.script_code, + language_code=language_info.language_code, + applied_rules=[], + confidence=language_info.confidence + ) + + # Detect script boundaries + boundaries = detect_script_boundaries(text) + + if len(boundaries) <= 1: + # Single script, use regular processing + return self.process_by_script(text, language_info, **kwargs) + + # Process each segment + processed_segments = [] + all_applied_rules = set() + all_word_boundaries = [] + offset = 0 + + for start, end, script_code in boundaries: + segment = text[start:end] + + # Create LanguageInfo for this segment + segment_lang_info = LanguageInfo( + language_code=language_info.language_code, # Use primary language + script_code=script_code, + confidence=language_info.confidence + ) + + # Process segment + segment_result = self.process_by_script(segment, segment_lang_info, **kwargs) + + # Adjust word boundaries for segment position + adjusted_boundaries = [b + start for b in segment_result.word_boundaries] + all_word_boundaries.extend(adjusted_boundaries) + + processed_segments.append(segment_result.text) + all_applied_rules.update(segment_result.applied_rules) + + # Combine processed segments + processed_text = ''.join(processed_segments) + + # Handle bidirectional text + processed_text = handle_bidirectional_text(processed_text) + + logger.debug( + f"Processed mixed-script text", + segments=len(boundaries), + scripts=[s for _, _, s in boundaries], + rules_applied=list(all_applied_rules) + ) + + return ProcessedText( + text=processed_text, + original=text, + script_code=language_info.script_code, # Primary script + language_code=language_info.language_code, + applied_rules=list(all_applied_rules), + word_boundaries=sorted(set(all_word_boundaries)), + confidence=language_info.confidence + ) + + +# Convenience functions +def process_by_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText: + """ + Convenience function to process text by script. + + Args: + text: Input text + language_info: LanguageInfo from Task 01.2 + **kwargs: Additional processor-specific options + + Returns: + ProcessedText with processed text and metadata + """ + handler = ScriptHandler() + return handler.process_by_script(text, language_info, **kwargs) + + +def process_mixed_script( + text: str, + language_info: LanguageInfo, + **kwargs +) -> ProcessedText: + """ + Convenience function to process mixed-script text. + + Args: + text: Input text (may contain multiple scripts) + language_info: LanguageInfo from Task 01.2 + **kwargs: Additional processor-specific options + + Returns: + ProcessedText with processed text and metadata + """ + handler = ScriptHandler() + return handler.process_mixed_script(text, language_info, **kwargs)