Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8f29d9a
Adds symbol parsers, reference index, and fuzzy matching.
aaronstevenwhite Sep 27, 2025
0d0cd75
Merge remote-tracking branch 'origin/main' into feature/symbol-parsin…
aaronstevenwhite Sep 28, 2025
caf1e79
Adds filters.
aaronstevenwhite Sep 28, 2025
5d22c94
Adds fuzzy search options to CLI and new cross-reference resolution i…
aaronstevenwhite Sep 28, 2025
98f1c72
Adds symbol parsers and upgraded search to use them.
aaronstevenwhite Sep 29, 2025
dec55c3
Adds docker spec.
aaronstevenwhite Sep 29, 2025
078050e
Bumps version and updates documentation.
aaronstevenwhite Sep 29, 2025
3b19df0
Adds API documentation and module docstrings.
aaronstevenwhite Sep 29, 2025
42e4282
Fixes range validation issue.
aaronstevenwhite Sep 29, 2025
ab9762d
Ensures GLAZING_DATA_DIR is used as the default if defined.
aaronstevenwhite Sep 29, 2025
d89af3c
Fixes range validation issue.
aaronstevenwhite Sep 29, 2025
e59d0fe
Fixes JSON serialization issue.
aaronstevenwhite Sep 29, 2025
5bccc5d
Downloads and converts data on docker image build.
aaronstevenwhite Sep 29, 2025
49e40a2
Fixes list formatting.
aaronstevenwhite Sep 29, 2025
2a002ed
Fixes incorrect python API documentation.
aaronstevenwhite Sep 29, 2025
ad1e485
Adds syntax-based search utilities.
aaronstevenwhite Sep 29, 2025
15fa6d8
Makes syntax-based search utilities more abstract and flexible.
aaronstevenwhite Sep 29, 2025
cbd5e6d
Refactors dataset-specific search tests.
aaronstevenwhite Sep 30, 2025
a620541
Adds syntax-based search documentation.
aaronstevenwhite Sep 30, 2025
8b7e883
Edits documentation/docstring wording.
aaronstevenwhite Sep 30, 2025
e207222
Fixes cross-referencing and converts all dataset references to lower-…
aaronstevenwhite Sep 30, 2025
f08c2e2
Adds JSON schemas and full examples.
aaronstevenwhite Sep 30, 2025
7159219
Adds fuzzy and syntax search documentation and makes thematic role an…
aaronstevenwhite Sep 30, 2025
42514b3
Makes download commands case-insensitive.
aaronstevenwhite Sep 30, 2025
087a204
Makes full CLI case-insensitive for dataset names.
aaronstevenwhite Sep 30, 2025
353a060
Adds docker information to README.
aaronstevenwhite Sep 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 97 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,102 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.2.0] - 2025-09-30

### Added

#### Symbol Parsing System
- **Symbol parsers** for all four linguistic resources (FrameNet, PropBank, VerbNet, WordNet)
- **Structured symbol extraction** for parsing and normalizing entity identifiers
- **Type-safe parsed symbol representations** using TypedDict patterns
- **Symbol parser documentation** - Complete API documentation for all symbol parser modules
- **Symbol parser caching** - LRU cache decorators on all parsing functions for better performance
- Support for parsing complex symbols like ARG1-PPT, ?Theme_i, Core[Agent]

#### Fuzzy Search and Matching
- **Fuzzy search capability** with Levenshtein distance-based matching
- **Configurable similarity thresholds** for controlling match precision
- **Multi-field fuzzy matching** across names, descriptions, and identifiers
- **Search result ranking** - New ranking module for scoring search results by match type and field relevance
- **Batch search methods** - `batch_by_lemma` method in UnifiedSearch for processing multiple queries
- `--fuzzy` flag in CLI commands with `--threshold` parameter
- `search_with_fuzzy()` method in UnifiedSearch and dataset-specific search classes

#### Syntax-Based Search
- **Unified syntax patterns** for searching by syntactic structure
- **Hierarchical pattern matching** where general patterns match specific subtypes
- **Syntax parser** for converting string patterns to unified format
- **Support for wildcards** and optional elements in patterns
- New CLI command: `glazing search syntax`
- `search_by_syntax()` method in UnifiedSearch class

#### Cross-Reference Enhancements
- **Automatic cross-reference extraction** on first use with progress indicators
- **Fuzzy resolution** for cross-references with typo tolerance
- **Confidence scoring** for mapping quality (0.0 to 1.0 scale)
- **Transitive mapping support** for indirect relationships
- **Reverse lookup capabilities** for bidirectional navigation
- New CLI commands: `glazing xref resolve`, `glazing xref extract`, `glazing xref clear-cache`

#### Structured Role/Argument Search
- **Property-based role search** for VerbNet thematic roles (optional, required, etc.)
- **Argument type filtering** for PropBank arguments (ARGM-LOC, ARGM-TMP, etc.)
- **Frame element search** by core type in FrameNet
- Support for complex queries with multiple property filters

#### Docker Support
- **Dockerfile** for containerized usage without local installation
- Full CLI exposed through Docker container
- Volume support for persistent data storage
- Docker Compose configuration example
- Interactive Python session support via container

#### CLI Improvements
- `--json` output mode for all search and xref commands
- `--progress` flag for long-running operations
- `--force` flag for cache clearing and re-extraction
- Better error messages with actionable suggestions
- Support for batch operations

### Changed

#### Type System Improvements
- Expanded `ArgumentNumber` type to include all modifier patterns (M-LOC, M-TMP, etc.)
- Added "C" and "R" prefixes to `FunctionTag` for continuation/reference support
- Stricter validation for `ThematicRoleType` with proper indexed variants
- More precise TypedDict definitions for parsed symbols

#### API Refinements
- `CrossReferenceIndex` now supports fuzzy matching in `resolve()` method
- `UnifiedSearch` class (renamed from `Search` for clarity)
- Consistent `None` returns for missing values (not empty strings or -1)
- Better separation of concerns between extraction, mapping, and resolution

### Fixed

- **CacheBase abstract methods** now have default implementations instead of NotImplementedError
- **VerbNet class ID generation** now uses deterministic pattern-based generation instead of hash-based fallback
- **Backward compatibility code removed** from PropBank symbol parser - no longer checks for argnum attribute
- **Legacy MappingSource removed** - "legacy" value no longer accepted in types
- **Documentation language** - removed promotional terms from fuzzy-match.md
- **Test compatibility** - Fixed PropBank symbol parser tests to work without backward compatibility
- PropBank `ArgumentNumber` type corrected to match actual data (removed invalid values like "7", "M-ADJ")
- ARGA argument in PropBank now correctly handled with proper arg_number value
- VerbNet member `verbnet_key` validation fixed to require proper format (e.g., "give#1")
- ThematicRole validation properly handles indexed role types (Patient_i, Theme_j)
- Import paths corrected for UnifiedSearch class
- Modifier type extraction returns `None` for non-modifiers consistently
- Frame element parsing handles abbreviations correctly
- Test fixtures updated to use correct data models and validation rules

### Technical Improvements

- Full mypy strict mode compliance across all modules
- Comprehensive test coverage for new symbol parsing features
- Performance optimizations for fuzzy matching with large datasets
- Better memory management for cross-reference extraction
- Caching improvements for repeated fuzzy searches

## [0.1.1] - 2025-09-27

### Fixed
Expand All @@ -29,7 +125,7 @@ Initial release of `glazing`, a package containing unified data models and inter
- **Unified data models** for all four linguistic resources using Pydantic v2
- **One-command initialization** with `glazing init` to download and convert all datasets
- **JSON Lines format** for efficient storage and streaming of large datasets
- **Type-safe interfaces** with comprehensive type hints for Python 3.13+
- **Type-safe interfaces** with comprehensive type hints using Python 3.13+ conventions
- **Cross-reference resolution** between FrameNet, PropBank, VerbNet, and WordNet
- **Memory-efficient streaming** support for processing large datasets

Expand Down
25 changes: 18 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ glazing init

## Code Style

We use `ruff` for code quality:
We use `ruff` for code quality and `mypy` for type checking:

```bash
# Format code
Expand All @@ -40,23 +40,34 @@ ruff format src/ tests/
# Lint code
ruff check src/ tests/

# Type checking
# Type checking (strict mode required)
mypy --strict src/
```

## Testing

```bash
# Run all tests
pytest
# Run all tests with verbose output
pytest tests/ -v

# Run with coverage
pytest --cov=glazing
pytest tests/ -v --cov=src/glazing --cov-report=term-missing

# Run specific test
pytest tests/test_verbnet/
# Run specific test module
pytest tests/test_verbnet/test_models.py -v

# Run specific test with debugging output
pytest tests/test_base.py::TestBaseModel::test_model_validation -xvs
```

### Testing Requirements

- All new features must have tests
- Tests should cover edge cases and error conditions
- Use descriptive test names that explain what is being tested
- Mock external dependencies and file I/O where appropriate
- Maintain or improve code coverage (aim for >90%)

## Documentation

```bash
Expand Down
47 changes: 47 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Use official Python 3.13 slim image as base
FROM python:3.13-slim

# Set working directory
WORKDIR /app

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1

# Install system dependencies required for building packages
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*

# Copy only requirements first to leverage Docker cache
COPY pyproject.toml README.md ./
COPY src/glazing/__version__.py src/glazing/

# Install package dependencies
RUN pip install --upgrade pip && \
pip install -e .

# Copy the rest of the application code
COPY src/ src/
COPY tests/ tests/

# Create data directory for datasets
RUN mkdir -p /data

# Set environment variable for data directory
ENV GLAZING_DATA_DIR=/data

# Initialize datasets during build
RUN glazing init --data-dir /data

# Expose data directory as volume
VOLUME ["/data"]

# Set the entrypoint to the glazing CLI
ENTRYPOINT ["glazing"]

# Default command shows help
CMD ["--help"]
90 changes: 70 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,43 @@ Unified data models and interfaces for syntactic and semantic frame ontologies.
- 🚀 **One-command setup**: `glazing init` downloads and prepares all datasets
- 📦 **Type-safe models**: Pydantic v2 validation for all data structures
- 🔍 **Unified search**: Query across all datasets with consistent API
- 🔗 **Cross-references**: Automatic mapping between resources
- 🔗 **Cross-references**: Automatic mapping between resources with confidence scores
- 🎯 **Fuzzy search**: Find matches even with typos or partial queries
- 🐳 **Docker support**: Use via Docker without local installation
- 💾 **Efficient storage**: JSON Lines format with streaming support
- 🐍 **Modern Python**: Full type hints, Python 3.13+ support

## Installation

### Via pip

```bash
pip install glazing
```

### Via Docker

Build and run Glazing in a containerized environment:

```bash
# Build the image
git clone https://github.com/aaronstevenwhite/glazing.git
cd glazing
docker build -t glazing:latest .

# Initialize datasets (persisted in volume)
docker run --rm -v glazing-data:/data glazing:latest init

# Use the CLI
docker run --rm -v glazing-data:/data glazing:latest search query "give"
docker run --rm -v glazing-data:/data glazing:latest search query "transfer" --fuzzy

# Interactive Python session
docker run --rm -it -v glazing-data:/data --entrypoint python glazing:latest
```

See the [installation docs](https://glazing.readthedocs.io/en/latest/installation/#docker-installation) for more Docker usage examples.

## Quick Start

Initialize all datasets (one-time setup, ~54MB download):
Expand Down Expand Up @@ -56,8 +83,23 @@ glazing search query "abandon"
# Search specific dataset
glazing search query "run" --dataset verbnet

# Use fuzzy search for typos
glazing search query "giv" --fuzzy
glazing search query "instrment" --fuzzy --threshold 0.7
```

Resolve cross-references:

```bash
# Extract cross-reference index (one-time setup)
glazing xref extract

# Find cross-references
glazing search cross-ref --source propbank --id "give.01" --target verbnet
glazing xref resolve "give.01" --source propbank
glazing xref resolve "give-13.1" --source verbnet

# Use fuzzy matching
glazing xref resolve "giv.01" --source propbank --fuzzy
```

## Python API
Expand All @@ -79,24 +121,32 @@ verb_classes = list(vn_loader.classes.values())
Cross-reference resolution:

```python
from glazing.references.extractor import ReferenceExtractor
from glazing.verbnet.loader import VerbNetLoader
from glazing.propbank.loader import PropBankLoader

# Load datasets
vn_loader = VerbNetLoader()
pb_loader = PropBankLoader()

# Extract references
extractor = ReferenceExtractor()
extractor.extract_verbnet_references(list(vn_loader.classes.values()))
extractor.extract_propbank_references(list(pb_loader.framesets.values()))

# Access PropBank cross-references
if "give.01" in extractor.propbank_refs:
refs = extractor.propbank_refs["give.01"]
vn_classes = refs.get_verbnet_classes()
print(f"VerbNet classes for give.01: {vn_classes}")
from glazing.references.index import CrossReferenceIndex

# Automatic extraction on first use (cached for future runs)
xref = CrossReferenceIndex()

# Resolve references for a PropBank roleset
refs = xref.resolve("give.01", source="propbank")
print(f"VerbNet classes: {refs['verbnet_classes']}")
print(f"Confidence scores: {refs['confidence_scores']}")

# Use fuzzy matching for typos
refs = xref.resolve("giv.01", source="propbank", fuzzy=True)
print(f"Found match with fuzzy search: {refs['verbnet_classes']}")
```

Fuzzy search in Python:

```python
from glazing.search import UnifiedSearch

# Use fuzzy search to handle typos
search = UnifiedSearch()
results = search.search_with_fuzzy("instrment", fuzzy_threshold=0.8)

for result in results[:5]:
print(f"{result.dataset}: {result.name} (score: {result.score:.2f})")
```

## Supported Datasets
Expand Down
5 changes: 5 additions & 0 deletions docs/api/framenet/symbol-parser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# glazing.framenet.symbol_parser

FrameNet symbol parsing utilities for frame and frame element names.

::: glazing.framenet.symbol_parser
2 changes: 1 addition & 1 deletion docs/api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ except ValidationError as e:

## Version Compatibility

This documentation covers Glazing version 0.1.1. Check your installed version:
This documentation covers Glazing version 0.2.0. Check your installed version:

```python
import glazing
Expand Down
5 changes: 5 additions & 0 deletions docs/api/propbank/symbol-parser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# glazing.propbank.symbol_parser

PropBank symbol parsing utilities for roleset IDs and argument labels.

::: glazing.propbank.symbol_parser
Loading