diff --git a/CHANGELOG.md b/CHANGELOG.md index eecc15d..7437768 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +### Changed + +- Improved documentation clarity and conciseness + ## [0.1.0] - 2025-09-23 Initial release of `glazing`, a package containing unified data models and interfaces for syntactic and semantic frame ontologies. diff --git a/docs/api/index.md b/docs/api/index.md index 27bf0e3..88b8d30 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -89,7 +89,7 @@ results = search.search("abandon") ## Type Safety -All models use Pydantic v2 for validation and provide comprehensive type hints. This ensures: +All models use Pydantic v2 for validation and provide complete type hints. This ensures: - Runtime validation of data - IDE autocomplete support diff --git a/docs/index.md b/docs/index.md index 244bff7..53cc5c7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -16,9 +16,9 @@ Glazing provides a unified, type-safe interface for working with FrameNet, PropB - 🚀 **One-command initialization:** Download and convert all datasets with `glazing init` - 📦 **Type-safe data models:** Using Pydantic v2 for validation and serialization -- 🔍 **Comprehensive CLI:** Download, convert, and search datasets from the command line +- 🔍 **Command-line interface:** Download, convert, and search datasets from the command line - 🔗 **Cross-dataset references:** Find connections between different linguistic resources -- 🐍 **Python 3.13+:** Modern Python with comprehensive type hints +- 🐍 **Python 3.13+:** Modern Python with full type hints - 📊 **Efficient storage:** JSON Lines format for fast loading and streaming ## Supported Datasets @@ -60,33 +60,17 @@ for result in results[:5]: print(f"{result.dataset}: {result.name} - {result.description}") ``` -## Documentation Structure +## Documentation -- **[Installation](installation.md):** System requirements and installation options -- **[Quick Start](quick-start.md):** Get up and running in minutes -- **[User Guide](user-guide/cli.md):** Detailed usage instructions -- **[API Reference](api/index.md):** Complete API documentation -- **[Contributing](contributing.md):** How to contribute to the project +Start with [Installation](installation.md) for system requirements, then follow the [Quick Start](quick-start.md) to get running in minutes. The [User Guide](user-guide/cli.md) covers detailed usage, while the [API Reference](api/index.md) documents all classes and methods. See [Contributing](contributing.md) if you'd like to help improve the project. ## Why Glazing? -Working with linguistic resources traditionally requires: - -- Understanding different data formats (XML, custom databases, etc.) -- Writing custom parsers for each resource -- Managing cross-references manually -- Dealing with inconsistent APIs - -Glazing solves these problems by providing: - -- Unified data models across all resources -- Automatic data conversion to efficient formats -- Built-in cross-reference resolution -- Consistent search and access patterns +Working with linguistic resources traditionally requires understanding different data formats (XML, custom databases), writing custom parsers for each resource, managing cross-references manually, and dealing with inconsistent APIs. Glazing solves these problems by providing unified data models across all resources, automatic data conversion to efficient formats, built-in cross-reference resolution, and consistent search and access patterns. ## Project Status -Glazing is actively maintained and welcomes contributions. The project follows semantic versioning and maintains comprehensive test coverage. +Glazing is actively maintained and welcomes contributions. The project follows semantic versioning and includes extensive test coverage. ## Links diff --git a/docs/quick-start.md b/docs/quick-start.md index 57fc9dd..98d2b95 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -1,142 +1,74 @@ # Quick Start -Get up and running with Glazing in minutes. This guide assumes you've already [installed](installation.md) the package. +Get Glazing running in minutes. This guide assumes you have Python 3.13+ and pip installed. -## Initialize Datasets - -Start by downloading and converting all datasets: +## Installation and Setup ```bash -glazing init +pip install glazing +glazing init # Downloads ~54MB, creates ~130MB of data ``` -This one-time setup downloads ~54MB of data and prepares it for use (~130MB total after conversion). - -## CLI Usage +The `init` command downloads all four datasets and converts them to an efficient format. This can take a few minutes but only needs to be done once. -### Search for a Word +## Command Line -Find entries across all datasets: +Search across all datasets: ```bash -# Search for "give" in all datasets (uses default data directory) glazing search query "give" - -# Search only in VerbNet -glazing search query "give" --dataset verbnet - -# Get JSON output -glazing search query "give" --json +glazing search query "give" --dataset verbnet # Limit to one dataset ``` -### Find Cross-References - -Discover connections between datasets: +Find cross-references between datasets: ```bash -# Find VerbNet classes for a PropBank roleset -glazing search cross-ref --source propbank --target verbnet --id "give.01" +glazing search cross-ref --source propbank --id "give.01" --target verbnet ``` -### Get Dataset Information - -Learn about available datasets: - -```bash -# List all datasets -glazing download list - -# Get info about VerbNet -glazing download info verbnet -``` - -## Python API Usage - -### Basic Search +## Python API ```python from glazing.search import UnifiedSearch -# Initialize search (automatically uses default paths) +# Search all datasets search = UnifiedSearch() - -# Search across all datasets results = search.search("abandon") for result in results[:5]: - print(f"{result.dataset}: {result.name}") - print(f" Type: {result.type}") - print(f" Description: {result.description[:100]}...") - print() -``` - -### Load Individual Datasets - -```python -from glazing.framenet.loader import FrameNetLoader -from glazing.verbnet.loader import VerbNetLoader - -# Loaders automatically use default paths and load data after 'glazing init' -fn_loader = FrameNetLoader() # Data is already loaded -frames = fn_loader.frames -print(f"Loaded {len(frames)} frames") - -vn_loader = VerbNetLoader() # Data is already loaded -verb_classes = list(vn_loader.classes.values()) -print(f"Loaded {len(verb_classes)} verb classes") + print(f"{result.dataset}: {result.name} - {result.description}") ``` -### Work with VerbNet Classes +Load specific datasets: ```python from glazing.verbnet.loader import VerbNetLoader -# Loader automatically uses default path and loads data loader = VerbNetLoader() - -# Access already loaded verb classes -classes = list(loader.classes.values()) +verb_classes = list(loader.classes.values()) # Find a specific class -give_class = next( - (vc for vc in classes if vc.id == "give-13.1"), - None -) - +give_class = next((vc for vc in verb_classes if vc.id == "give-13.1"), None) if give_class: - print(f"Class: {give_class.id}") - print(f"Members: {[m.name for m in give_class.members[:5]]}") - print(f"Thematic Roles: {[tr.role_type for tr in give_class.themroles]}") - - # Examine frames - for frame in give_class.frames[:2]: - print(f"\nFrame: {frame.description.primary}") - print(f" Example: {frame.examples[0] if frame.examples else 'N/A'}") + print(f"Members: {[m.name for m in give_class.members]}") + print(f"Roles: {[tr.role_type for tr in give_class.themroles]}") ``` -### Work with PropBank +Work with WordNet synsets: ```python -from glazing.propbank.loader import PropBankLoader - -# Loader automatically uses default path and loads data -loader = PropBankLoader() - -# Access already loaded framesets -framesets = list(loader.framesets.values()) +from glazing.wordnet.loader import WordNetLoader -# Find rolesets for "give" -give_framesets = [fs for fs in framesets if fs.lemma == "give"] +loader = WordNetLoader() +synsets = list(loader.synsets.values()) -for frameset in give_framesets: - print(f"Frameset: {frameset.lemma}") - for roleset in frameset.rolesets: - print(f" Roleset: {roleset.id} - {roleset.name}") - for role in roleset.roles: - print(f" {role.argnum}: {role.description}") +# Find synsets for "dog" +dog_synsets = [s for s in synsets if any(l.lemma == "dog" for l in s.lemmas)] +for synset in dog_synsets[:3]: + print(f"{synset.id}: {synset.definition}") ``` -### Cross-Reference Resolution +Extract cross-references: ```python from glazing.references.extractor import ReferenceExtractor @@ -144,119 +76,21 @@ from glazing.references.resolver import ReferenceResolver from glazing.verbnet.loader import VerbNetLoader from glazing.propbank.loader import PropBankLoader -# Load datasets -vn_loader = VerbNetLoader() # Automatically loads data -pb_loader = PropBankLoader() # Automatically loads data +vn_loader = VerbNetLoader() +pb_loader = PropBankLoader() -# Extract references extractor = ReferenceExtractor() extractor.extract_verbnet_references(list(vn_loader.classes.values())) extractor.extract_propbank_references(list(pb_loader.framesets.values())) -# Resolve references for a PropBank roleset resolver = ReferenceResolver(extractor.mapping_index) related = resolver.resolve("give.01", source="propbank") - -print(f"PropBank roleset: give.01") print(f"VerbNet classes: {related.verbnet_classes}") -print(f"FrameNet frames: {related.framenet_frames}") -print(f"WordNet senses: {related.wordnet_senses}") -``` - -### WordNet Synsets and Relations - -```python -from glazing.wordnet.loader import WordNetLoader - -# Loader automatically uses default path and loads data -loader = WordNetLoader() -synsets = list(loader.synsets.values()) # Already loaded - -# Find synsets for "dog" -dog_synsets = [s for s in synsets if any( - l.lemma == "dog" for l in s.lemmas -)] - -for synset in dog_synsets[:3]: - print(f"Synset: {synset.id}") - print(f" POS: {synset.pos}") - print(f" Definition: {synset.definition}") - print(f" Lemmas: {[l.lemma for l in synset.lemmas]}") - - # Show hypernyms - if synset.relations: - hypernyms = [r for r in synset.relations if r.type == "hypernym"] - if hypernyms: - print(f" Hypernyms: {[h.target_id for h in hypernyms]}") -``` - -### Streaming Large Files - -For memory-efficient processing: - -```python -from glazing.verbnet.loader import VerbNetLoader - -# For memory-efficient streaming, use lazy loading -loader = VerbNetLoader(lazy=True, autoload=False) - -# Stream verb classes one at a time -for verb_class in loader.iter_verb_classes(): - # Process each class without loading all into memory - if "run" in [m.name for m in verb_class.members]: - print(f"Found 'run' in class: {verb_class.id}") - break -``` - -## Common Patterns - -### Find Semantic Roles - -```python -from glazing.verbnet.search import VerbNetSearch -from glazing.verbnet.loader import VerbNetLoader - -# Loader automatically loads data -loader = VerbNetLoader() -search = VerbNetSearch(list(loader.classes.values())) - -# Find all classes with an Agent role -agent_classes = [] -for vc in search.verb_classes: - if any(tr.role_type == "Agent" for tr in vc.themroles): - agent_classes.append(vc.id) - -print(f"Classes with Agent role: {len(agent_classes)}") -``` - -### Export to Custom Format - -```python -import json -from glazing.framenet.loader import FrameNetLoader - -# Loader automatically uses default path and loads data -loader = FrameNetLoader() -frames = loader.frames # Already loaded - -# Export as simple JSON -simple_frames = [] -for frame in frames[:10]: - simple_frames.append({ - "id": frame.id, - "name": frame.name, - "definition": frame.definition.plain_text if frame.definition else "", - "frame_elements": [fe.name for fe in frame.frame_elements] - }) - -# Save to file -with open("frames_simple.json", "w") as f: - json.dump(simple_frames, f, indent=2) ``` ## Next Steps -- Explore the [CLI documentation](user-guide/cli.md) for advanced command-line usage -- Read the [Python API guide](user-guide/python-api.md) for detailed programming examples -- Check the [API Reference](api/index.md) for complete documentation -- Learn about [cross-references](user-guide/cross-references.md) between datasets +- [CLI Documentation](user-guide/cli.md) for command-line options +- [Python API Guide](user-guide/python-api.md) for programming details +- [Cross-References](user-guide/cross-references.md) for connecting datasets +- [API Reference](api/index.md) for complete documentation diff --git a/docs/user-guide/cli.md b/docs/user-guide/cli.md index e96cd0f..58e3159 100644 --- a/docs/user-guide/cli.md +++ b/docs/user-guide/cli.md @@ -1,180 +1,98 @@ # CLI Usage -Detailed command-line interface documentation for Glazing. +The `glazing` command provides access to all functionality from the terminal. -## Overview - -The glazing CLI provides commands for downloading, converting, and searching linguistic datasets. - -## Global Options +## Quick Reference ```bash -glazing [OPTIONS] COMMAND [ARGS]... - -Options: - --version Show version - --verbose Enable verbose output - --quiet Suppress non-essential output - --help Show help message +glazing init # Download and set up all datasets +glazing search query "give" # Search across datasets +glazing search entity give.01 # Look up specific entity +glazing download list # Show available datasets ``` -## Commands - -### init - -Initialize all datasets with one command: - -```bash -glazing init [OPTIONS] - -Options: - --data-dir PATH Directory to store datasets - --force Force re-download even if data exists -``` - -### download - -Download datasets from official sources: - -```bash -glazing download dataset [OPTIONS] - -Options: - --dataset TEXT Dataset to download (framenet|propbank|verbnet|wordnet|all) - --output-dir PATH Output directory for raw data -``` +## Initialization -List available datasets: +The first step is always `glazing init`, which downloads and prepares all datasets. You can specify a custom location: ```bash -glazing download list +glazing init --data-dir /my/data/path ``` -Get dataset information: +Or use an environment variable: ```bash -glazing download info DATASET +export GLAZING_DATA_DIR=/my/data/path +glazing init ``` -### convert +## Searching -Convert raw datasets to JSON Lines format: +The search command is the main way to explore the data. Search by text query: ```bash -glazing convert dataset [OPTIONS] - -Options: - --dataset TEXT Dataset to convert - --input-dir PATH Directory containing raw data - --output-dir PATH Directory for converted data +glazing search query "abandon" +glazing search query "run" --dataset verbnet +glazing search query "give" --limit 10 --json ``` -### search - -Search across datasets: +Look up specific entities by their IDs: ```bash -# Search by query -glazing search query QUERY [OPTIONS] - -# Search for specific entity -glazing search entity ID [OPTIONS] - -# Find cross-references -glazing search cross-ref [OPTIONS] - -Common Options: - --dataset TEXT Limit to specific dataset - --data-dir PATH Directory containing converted data - --json Output as JSON - --limit INTEGER Limit number of results +glazing search entity give-13.1 --dataset verbnet +glazing search entity 01772306 --dataset wordnet ``` -## Examples - -### Full Workflow +Find cross-references between datasets: ```bash -# 1. Initialize everything -glazing init - -# 2. Search for a concept (uses default data directory) -glazing search query "give" - -# 3. Find cross-references glazing search cross-ref --source propbank --id "give.01" --target verbnet +glazing search cross-ref --source verbnet --id "give-13.1" --target all ``` -### Custom Data Directory +## Downloading and Converting -```bash -# Set custom directory -export GLAZING_DATA_DIR=/my/data/path - -# Initialize there -glazing init - -# All commands will use this directory -glazing search query "run" -``` - -### Processing Individual Datasets +If you need to work with individual datasets or update them: ```bash -# Download only VerbNet glazing download dataset --dataset verbnet +glazing convert dataset --dataset verbnet --input-dir raw --output-dir converted +``` -# Convert it -glazing convert dataset --dataset verbnet +To see what's available: -# Search it (uses default converted directory) -glazing search query "run" --dataset verbnet +```bash +glazing download list +glazing download info verbnet ``` ## Output Formats -### Default (Human-Readable) - -``` -Dataset: verbnet -Entity: give-13.1 -Type: VerbClass -Description: Transfer of possession -``` - -### JSON Output +By default, output is formatted for human reading. Add `--json` for programmatic use: ```bash -glazing search query "give" --json +glazing search query "give" --json | jq '.results[0]' ``` -Returns structured JSON for programmatic use. - -## Environment Variables - -- `GLAZING_DATA_DIR`: Default data directory -- `GLAZING_SKIP_INIT_CHECK`: Skip initialization check -- `XDG_DATA_HOME`: Base data directory (Linux/macOS) - ## Troubleshooting -### Data Not Found +If searches return no results, check that initialization completed: ```bash -# Check if initialized -glazing init - -# Verify data location ls ~/.local/share/glazing/converted/ ``` -### Permission Issues +For permission issues, use a different directory: ```bash -# Use different directory glazing init --data-dir ~/my-data +export GLAZING_DATA_DIR=~/my-data ``` -### Memory Issues +Memory issues with large datasets can be avoided by processing them individually rather than using `--dataset all`. + +## Environment Variables -Process datasets individually instead of using `--dataset all`. +- `GLAZING_DATA_DIR`: Where to store/find data (default: `~/.local/share/glazing`) +- `GLAZING_SKIP_INIT_CHECK`: Skip the initialization warning when importing the package +- `XDG_DATA_HOME`: Alternative base directory on Linux/macOS diff --git a/docs/user-guide/cross-references.md b/docs/user-guide/cross-references.md index d2199a4..0525fb5 100644 --- a/docs/user-guide/cross-references.md +++ b/docs/user-guide/cross-references.md @@ -1,52 +1,23 @@ # Cross-References -Guide to working with cross-references between datasets. +Glazing connects FrameNet, PropBank, VerbNet, and WordNet through their internal cross-references. While these connections exist in the original datasets, extracting and using them typically requires understanding each dataset's format. Glazing provides a unified interface to work with these references. -## Overview +## How References Work -Glazing provides tools to find connections between FrameNet, PropBank, VerbNet, and WordNet. +The four datasets reference each other in different ways. PropBank rolesets often specify their corresponding VerbNet classes. VerbNet members include WordNet sense keys. FrameNet lexical units sometimes reference WordNet synsets. These connections allow you to trace a concept across different linguistic representations. -## Reference Types +For example, the PropBank roleset `give.01` maps to VerbNet class `give-13.1`, which contains members linked to WordNet senses like `give%2:40:00::`. This lets you connect PropBank's argument structure to VerbNet's thematic roles and WordNet's semantic hierarchy. -### PropBank → VerbNet +## Basic Usage -PropBank rolesets map to VerbNet classes: - -```python -# PropBank: give.01 → VerbNet: give-13.1 -``` - -### VerbNet → WordNet - -VerbNet members link to WordNet senses: - -```python -# VerbNet: give (member) → WordNet: give%2:40:00:: -``` - -### FrameNet → WordNet - -FrameNet lexical units reference WordNet: - -```python -# FrameNet: give.v → WordNet: give%2:40:00:: -``` - -## Using References - -### CLI +The simplest way to find cross-references is through the CLI: ```bash -# Find all references for a PropBank roleset -glazing search cross-ref --source propbank --id "give.01" \ - --target all --data-dir ~/.local/share/glazing/converted - -# Find VerbNet classes for PropBank -glazing search cross-ref --source propbank --id "give.01" \ - --target verbnet --data-dir ~/.local/share/glazing/converted +# Find what VerbNet classes correspond to a PropBank roleset +glazing search cross-ref --source propbank --id "give.01" --target verbnet ``` -### Python API +In Python, the process requires extracting references from the loaded datasets: ```python from glazing.references.extractor import ReferenceExtractor @@ -54,189 +25,59 @@ from glazing.references.resolver import ReferenceResolver from glazing.verbnet.loader import VerbNetLoader from glazing.propbank.loader import PropBankLoader -# Load datasets -vn_loader = VerbNetLoader() # Automatically loads data -pb_loader = PropBankLoader() # Automatically loads data +# Load and extract references +vn_loader = VerbNetLoader() +pb_loader = PropBankLoader() -# Extract references extractor = ReferenceExtractor() extractor.extract_verbnet_references(list(vn_loader.classes.values())) extractor.extract_propbank_references(list(pb_loader.framesets.values())) -# Resolve for specific item +# Resolve references resolver = ReferenceResolver(extractor.mapping_index) related = resolver.resolve("give.01", source="propbank") - -print(f"VerbNet: {related.verbnet_classes}") -print(f"WordNet: {related.wordnet_senses}") -print(f"FrameNet: {related.framenet_frames}") -``` - -## Reference Extraction - -### Automatic Extraction - -```python -from glazing.references.extractor import ReferenceExtractor -from glazing.verbnet.loader import VerbNetLoader - -# Load dataset -vn_loader = VerbNetLoader() - -# Extract references -extractor = ReferenceExtractor() - -# Extract from specific dataset -extractor.extract_verbnet_references(list(vn_loader.classes.values())) - -# Access extracted references -vn_refs = extractor.verbnet_refs -``` - -### Manual Mapping - -```python -from glazing.references.mapper import ReferenceMapper - -mapper = ReferenceMapper() - -# Map PropBank to VerbNet -vn_classes = mapper.propbank_to_verbnet("give.01") - -# Map VerbNet to WordNet -wn_senses = mapper.verbnet_to_wordnet("give-13.1") -``` - -## Reference Resolution - -### Simple Resolution - -```python -resolver = ReferenceResolver(references) - -# Get all related items -related = resolver.resolve("give.01", source="propbank") -``` - -### Transitive Resolution - -```python -# Follow chains of references -# PropBank → VerbNet → WordNet -chain = resolver.resolve_transitive("give.01", source="propbank") -``` - -### Batch Resolution - -```python -rolesets = ["give.01", "take.01", "run.02"] - -results = {} -for roleset in rolesets: - results[roleset] = resolver.resolve(roleset, source="propbank") +print(f"VerbNet classes: {related.verbnet_classes}") ``` -## Examples +## Working with References -### Finding Semantic Equivalents +The extraction step scans the datasets for embedded cross-references and builds an index. This is computationally expensive, so you'll want to do it once and reuse the results. The resolver then uses this index to find connections between datasets. -```python -def find_semantic_equivalents(word): - # Search all datasets - search = UnifiedSearch() - results = search.search_by_lemma(word) +When you resolve references for an item, you get back all the related items across datasets. Not every item has cross-references to all other datasets. Some connections are direct (explicitly stated in the data) while others are transitive (following chains of references). - # Group by dataset - by_dataset = {} - for result in results: - if result.dataset not in by_dataset: - by_dataset[result.dataset] = [] - by_dataset[result.dataset].append(result) +## Practical Examples - return by_dataset - -equivalents = find_semantic_equivalents("give") -``` - -### Building Reference Graph +To find semantic equivalents across datasets, search each one and collect the results: ```python -import networkx as nx - -def build_reference_graph(references): - G = nx.Graph() +from glazing.search import UnifiedSearch - for ref in references: - # Add nodes - G.add_node(ref.source_id, dataset=ref.source) - G.add_node(ref.target_id, dataset=ref.target) +search = UnifiedSearch() +results = search.search_by_lemma("give") - # Add edge - G.add_edge(ref.source_id, ref.target_id) - - return G +# Group results by dataset +by_dataset = {} +for result in results: + by_dataset.setdefault(result.dataset, []).append(result) ``` -### Cross-Dataset Analysis +For analyzing coverage of a concept across datasets: ```python -def analyze_coverage(lemma): +def check_coverage(lemma): search = UnifiedSearch() - resolver = ReferenceResolver(references) - - # Find in each dataset - coverage = { - 'propbank': False, - 'verbnet': False, - 'wordnet': False, - 'framenet': False - } - results = search.search_by_lemma(lemma) - for result in results: - coverage[result.dataset] = True - - # Check cross-references - related = resolver.resolve(result.id, source=result.dataset) - for dataset in coverage: - if getattr(related, f"{dataset}_ids"): - coverage[dataset] = True + coverage = set(r.dataset for r in results) + missing = {'propbank', 'verbnet', 'wordnet', 'framenet'} - coverage + if missing: + print(f"{lemma} not found in: {', '.join(missing)}") return coverage ``` -## Reference Data Model - -```python -@dataclass -class Reference: - source: str # Source dataset - source_id: str # ID in source - target: str # Target dataset - target_id: str # ID in target - confidence: float # Match confidence - -@dataclass -class ResolvedReferences: - source_id: str - propbank_ids: list[str] - verbnet_classes: list[str] - wordnet_senses: list[str] - framenet_frames: list[str] -``` - -## Best Practices - -1. **Cache references**: Extract once and reuse -2. **Validate IDs**: Check IDs exist before resolving -3. **Handle missing**: Not all items have cross-references -4. **Consider confidence**: Some matches are approximate -5. **Use batch operations**: Resolve multiple items together - ## Limitations -- Not all entries have cross-references -- Some references may be approximate matches -- Reference quality varies by dataset pair -- Transitive references may introduce noise +Cross-references in these datasets are incomplete and sometimes approximate. VerbNet members don't always have WordNet mappings. PropBank rolesets may lack VerbNet mappings. The quality and coverage of references varies between dataset pairs. Transitive references (A→B→C) can introduce errors if the intermediate mapping is incorrect. + +The current API requires manual extraction before resolution, which we plan to improve in future versions to match the ergonomics of the data loaders. diff --git a/docs/user-guide/python-api.md b/docs/user-guide/python-api.md index d6b1f7d..0175b85 100644 --- a/docs/user-guide/python-api.md +++ b/docs/user-guide/python-api.md @@ -1,250 +1,162 @@ # Python API -Comprehensive guide to using Glazing's Python API. - -## Installation - -```python -pip install glazing -``` +After running `glazing init`, you can use the Python API to work with the linguistic datasets. ## Basic Usage -### Unified Search +The simplest entry point is unified search, which queries all datasets at once: ```python from glazing.search import UnifiedSearch -# Initialize search (automatically uses default data directory) search = UnifiedSearch() - -# Search across all datasets results = search.search("abandon") -# Search with filters -results = search.search("run") - -# Search by lemma with POS filter -results = search.by_lemma("run", pos="v") +# Filter by part of speech +verb_results = search.by_lemma("run", pos="v") ``` -### Loading Datasets +## Loading Individual Datasets + +Each dataset has its own loader that provides access to the full data: ```python -from glazing.framenet.loader import FrameNetLoader -from glazing.propbank.loader import PropBankLoader from glazing.verbnet.loader import VerbNetLoader from glazing.wordnet.loader import WordNetLoader -# All loaders automatically use default paths and load data after 'glazing init' -fn_loader = FrameNetLoader() # Data is already loaded -frames = fn_loader.frames - -pb_loader = PropBankLoader() # Data is already loaded -framesets = list(pb_loader.framesets.values()) - -vn_loader = VerbNetLoader() # Data is already loaded +# Loaders automatically find and load data from the default location +vn_loader = VerbNetLoader() verb_classes = list(vn_loader.classes.values()) -wn_loader = WordNetLoader() # Data is already loaded +wn_loader = WordNetLoader() synsets = list(wn_loader.synsets.values()) ``` -## Advanced Features - -### Streaming Large Files +The loaders handle the JSON Lines format transparently. For large datasets, you can iterate instead of loading everything into memory: ```python -from glazing.verbnet.loader import VerbNetLoader - -loader = VerbNetLoader() # Automatically uses default path -# Process one at a time without loading all into memory +loader = VerbNetLoader() for verb_class in loader.iter_verb_classes(): if "run" in [m.name for m in verb_class.members]: - print(f"Found in: {verb_class.id}") + print(f"Found in class: {verb_class.id}") break ``` -### Cross-Reference Resolution +## Data Models + +All data structures are Pydantic models with full type hints and validation. This gives you IDE autocomplete, type checking with mypy, and automatic JSON serialization: ```python -from glazing.references.extractor import ReferenceExtractor -from glazing.references.resolver import ReferenceResolver -from glazing.initialize import get_default_data_path +from glazing.verbnet.models import Member -# Extract references (uses default data directory) -extractor = ReferenceExtractor() -references = extractor.extract_from_datasets(get_default_data_path()) +member = Member(name="run", grouping="run.01") +print(member.name) # IDE knows this is a string -# Resolve for a specific item -resolver = ReferenceResolver(references) -related = resolver.resolve("give.01", source="propbank") +# Export to various formats +data_dict = member.model_dump() +json_str = member.model_dump_json() ``` -### Custom Searching +## Searching Within Datasets + +Each dataset has specialized search capabilities beyond simple text matching. VerbNet lets you search by thematic roles or syntactic patterns: ```python from glazing.verbnet.search import VerbNetSearch from glazing.verbnet.loader import VerbNetLoader -from glazing.verbnet.types import ThematicRoleType -# Loader automatically loads data loader = VerbNetLoader() -verb_classes = list(loader.classes.values()) # Already loaded - -# Initialize search with loaded data -search = VerbNetSearch(verb_classes) - -# Find by members -classes = search.by_members(["run"]) +search = VerbNetSearch(list(loader.classes.values())) -# Find by thematic roles -agent_classes = search.by_themroles([ThematicRoleType.AGENT]) +# Find classes with specific thematic roles +agent_classes = search.by_themroles(["Agent", "Theme"]) -# Find by syntax pattern +# Find by syntactic pattern motion_classes = search.by_syntax("NP V PP") ``` -## Data Models +## Cross-References -All models use Pydantic v2 for validation: +Cross-references between datasets require extraction before use. This scans the data for embedded references and builds an index: ```python -from glazing.verbnet.models import VerbClass, Member - -# Models are validated -member = Member(name="run", grouping="run.01") - -# Access attributes with IDE support -print(member.name) -print(member.grouping) - -# Export to dict/JSON -data = member.model_dump() -json_str = member.model_dump_json() -``` - -## Error Handling - -```python -from pydantic import ValidationError -from glazing.framenet.loader import FrameNetLoader - -try: - loader = FrameNetLoader() # Automatically loads data - frames = loader.frames -except FileNotFoundError: - print("Data not found - run 'glazing init'") -except ValidationError as e: - print(f"Invalid data format: {e}") -``` - -## Performance Tips - -### Use Caching - -```python -from glazing.utils.cache import cached_search - -# Results are cached automatically -@cached_search -def find_related(lemma): - return search.search_by_lemma(lemma) -``` - -### Batch Processing - -```python -# Process multiple items efficiently -lemmas = ["give", "take", "run", "walk"] -all_results = {} - -for lemma in lemmas: - all_results[lemma] = search.by_lemma(lemma) -``` +from glazing.references.extractor import ReferenceExtractor +from glazing.references.resolver import ReferenceResolver -### Memory Management +# Extract references (expensive operation, do once) +extractor = ReferenceExtractor() +extractor.extract_verbnet_references(verb_classes) +extractor.extract_propbank_references(framesets) -```python -# Use generators for large datasets -def process_large_dataset(): - loader = VerbNetLoader() # Automatically loads data - for batch in loader.iter_verb_classes(): - for verb_class in batch: - yield process_class(verb_class) +# Resolve references (fast lookup) +resolver = ReferenceResolver(extractor.mapping_index) +related = resolver.resolve("give.01", source="propbank") ``` -## Integration Examples +## Integration with NLP Tools -### Flask Web API +The standardized data models make it easy to integrate with other NLP libraries. For spaCy: ```python -from flask import Flask, jsonify +import spacy from glazing.search import UnifiedSearch -app = Flask(__name__) +nlp = spacy.load("en_core_web_sm") search = UnifiedSearch() -@app.route('/api/search/') -def search_endpoint(query): - results = search.search(query) - return jsonify([r.__dict__ for r in results[:10]]) +doc = nlp("The dog ran quickly") +for token in doc: + if token.pos_ == "VERB": + results = search.by_lemma(token.lemma_, pos="v") + # Use results to enhance token with frame information ``` -### Pandas DataFrames +For pandas users, the models convert cleanly to DataFrames: ```python import pandas as pd -from glazing.wordnet.loader import WordNetLoader - -loader = WordNetLoader() # Automatically loads data -synsets = list(loader.synsets.values()) -# Convert to DataFrame -df = pd.DataFrame([ +synset_data = [ { 'id': s.id, 'pos': s.pos, 'definition': s.definition, 'lemmas': ', '.join([l.lemma for l in s.lemmas]) } - for s in synsets -]) + for s in synsets[:100] +] +df = pd.DataFrame(synset_data) ``` -### NLP Pipelines +## Error Handling + +The most common error is missing data files. The loaders raise clear exceptions: ```python -import spacy -from glazing.search import UnifiedSearch +from glazing.framenet.loader import FrameNetLoader -nlp = spacy.load("en_core_web_sm") -search = UnifiedSearch() +try: + loader = FrameNetLoader() +except FileNotFoundError: + print("Run 'glazing init' first to download the data") +``` -def enrich_with_frames(text): - doc = nlp(text) - enriched = [] +Validation errors from Pydantic show exactly what went wrong: - for token in doc: - if token.pos_ == "VERB": - results = search.by_lemma(token.lemma_, pos="v") - enriched.append({ - 'token': token.text, - 'lemma': token.lemma_, - 'frames': [r.frames[0].name if r.frames else "" for r in results[:3]] - }) +```python +from pydantic import ValidationError +from glazing.verbnet.models import Member - return enriched +try: + member = Member(name="") # Invalid: empty name +except ValidationError as e: + print(e) # Shows field name and constraint violated ``` -## Best Practices +## Performance Considerations -1. **Initialize once**: Create loaders/searchers once and reuse -2. **Use type hints**: Leverage IDE support and type checking -3. **Handle errors**: Always handle file and validation errors -4. **Stream when possible**: Use streaming for large datasets -5. **Cache results**: Cache expensive operations +The loaders cache data in memory after the first access. If you're processing large datasets, use iteration instead of loading everything at once. For repeated searches, consider caching results or using the built-in cache decorators in `glazing.utils.cache`. -## API Reference +## Further Reading -See the [API Reference](../api/index.md) for complete documentation. +See the [API Reference](../api/index.md) for detailed documentation of all classes and methods. diff --git a/src/glazing/base.py b/src/glazing/base.py index 9ba63e4..7da2e1c 100644 --- a/src/glazing/base.py +++ b/src/glazing/base.py @@ -247,7 +247,7 @@ class CrossReferenceBase(GlazingBaseModel): @field_validator("source_id", "target_id") @classmethod def validate_ids(cls, v: str | list[str]) -> str | list[str]: - """Validate that IDs are non-empty strings.""" + """IDs must be non-empty strings.""" if isinstance(v, str): if not v.strip(): raise ValueError("ID cannot be empty") @@ -261,7 +261,7 @@ def validate_ids(cls, v: str | list[str]) -> str | list[str]: @model_validator(mode="after") def validate_datasets(self) -> Self: - """Validate that source and target datasets are different.""" + """Source and target must be different datasets.""" if self.source_dataset == self.target_dataset and self.mapping_type not in ( "inherited", "transitive", @@ -330,7 +330,7 @@ class MappingBase(GlazingBaseModel): @model_validator(mode="after") def validate_modification(self) -> Self: - """Ensure modification fields are consistent.""" + """If modified_by is set, modified_at must also be set.""" if self.modified_date and not self.modified_by: raise ValueError("modified_by required when modified_date is set") if self.modified_by and not self.modified_date: @@ -393,58 +393,58 @@ def validate_pattern(value: str, pattern: str, field_name: str) -> str: def validate_frame_id(value: int | str) -> str: - """Validate a FrameNet frame ID.""" + """Check FrameNet frame ID format (positive integer).""" str_value = str(value) return validate_pattern(str_value, FRAME_ID_PATTERN, "frame ID") def validate_frame_name(value: str) -> str: - """Validate a FrameNet frame name.""" + """Check FrameNet frame name format.""" return validate_pattern(value, FRAME_NAME_PATTERN, "frame name") def validate_fe_name(value: str) -> str: - """Validate a FrameNet frame element name.""" + """Check FrameNet FE name format.""" return validate_pattern(value, FE_NAME_PATTERN, "frame element name") def validate_verbnet_class(value: str) -> str: - """Validate a VerbNet class ID.""" + """Check VerbNet class ID format (e.g., give-13.1).""" return validate_pattern(value, VERBNET_CLASS_PATTERN, "VerbNet class ID") def validate_verbnet_key(value: str) -> str: - """Validate a VerbNet member key.""" + """Check VerbNet member key format.""" return validate_pattern(value, VERBNET_KEY_PATTERN, "VerbNet key") def validate_propbank_roleset(value: str) -> str: - """Validate a PropBank roleset ID.""" + """Check PropBank roleset ID format (lemma.##).""" return validate_pattern(value, PROPBANK_ROLESET_PATTERN, "PropBank roleset ID") def validate_wordnet_offset(value: str) -> str: - """Validate a WordNet synset offset.""" + """Check WordNet synset offset format.""" return validate_pattern(value, WORDNET_OFFSET_PATTERN, "WordNet offset") def validate_wordnet_sense_key(value: str) -> str: - """Validate a WordNet sense key.""" + """Check WordNet sense key format.""" return validate_pattern(value, WORDNET_SENSE_KEY_PATTERN, "WordNet sense key") def validate_percentage_notation(value: str) -> str: - """Validate VerbNet's WordNet percentage notation.""" + """Check VerbNet's WordNet notation (lemma%#:#:#::).""" return validate_pattern(value, PERCENTAGE_NOTATION_PATTERN, "percentage notation") def validate_lemma(value: str) -> str: - """Validate a word lemma.""" + """Check that lemma contains valid characters.""" return validate_pattern(value, LEMMA_PATTERN, "lemma") def validate_hex_color(value: str) -> str: - """Validate a 6-digit hex color code.""" + """Check hex color format (#RRGGBB).""" return validate_pattern(value, HEX_COLOR_PATTERN, "hex color") @@ -497,7 +497,7 @@ class ConflictResolution(GlazingBaseModel): @model_validator(mode="after") def validate_resolution(self) -> Self: - """Ensure resolution is consistent.""" + """Resolution status must match presence of resolved_by/resolved_at.""" if not self.selected_mapping and not self.rejected_mappings: raise ValueError("Resolution must have either selected or rejected mappings") return self diff --git a/src/glazing/cli/__init__.py b/src/glazing/cli/__init__.py index 5c25bfd..c2fd976 100644 --- a/src/glazing/cli/__init__.py +++ b/src/glazing/cli/__init__.py @@ -1,6 +1,6 @@ """Command-line interface for the glazing package. -This module provides a comprehensive CLI for managing linguistic datasets +This module provides a CLI for managing linguistic datasets including downloading, converting, searching, and information commands. Commands @@ -85,7 +85,7 @@ def cli(ctx: click.Context, verbose: bool, quiet: bool) -> None: @click.option("--force", is_flag=True, help="Force re-download even if data exists") @click.pass_context def init(ctx: click.Context, data_dir: str | Path | None, force: bool) -> None: - """Initialize all datasets by downloading and converting them.""" + """Download and convert all linguistic datasets for first-time setup.""" quiet = ctx.obj.get("quiet", False) # Convert to Path if provided diff --git a/src/glazing/cli/search.py b/src/glazing/cli/search.py index ca1fefc..194e69a 100644 --- a/src/glazing/cli/search.py +++ b/src/glazing/cli/search.py @@ -39,7 +39,7 @@ def _display_verbnet_details(entity: VerbClass) -> None: - """Display VerbNet-specific entity details.""" + """Show VerbNet class members, roles, and frames.""" if hasattr(entity, "members"): console.print(f"[white]Members:[/white] {len(entity.members)}") if hasattr(entity, "themroles"): @@ -51,7 +51,7 @@ def _display_verbnet_details(entity: VerbClass) -> None: def _display_propbank_details(entity: Frameset) -> None: - """Display PropBank-specific entity details.""" + """Show PropBank frameset rolesets.""" if hasattr(entity, "rolesets"): console.print(f"[white]Rolesets:[/white] {len(entity.rolesets)}") for rs in entity.rolesets[:3]: # Show first 3 @@ -59,7 +59,7 @@ def _display_propbank_details(entity: Frameset) -> None: def _display_wordnet_details(entity: Synset) -> None: - """Display WordNet-specific entity details.""" + """Show WordNet synset words and definition.""" if hasattr(entity, "words"): entity_words = getattr(entity, "words", []) words = ", ".join(getattr(w, "lemma", str(w)) for w in entity_words[:5]) @@ -69,7 +69,7 @@ def _display_wordnet_details(entity: Synset) -> None: def _display_framenet_details(entity: Frame) -> None: - """Display FrameNet-specific entity details.""" + """Show FrameNet frame elements.""" if hasattr(entity, "frame_elements"): console.print(f"[white]Frame Elements:[/white] {len(entity.frame_elements)}") for fe in entity.frame_elements[:5]: # Show first 5 @@ -166,7 +166,7 @@ def load_search_index(data_dir: str | Path, datasets: list[str] | None = None) - @click.group() def search() -> None: - """Search across converted linguistic datasets.""" + """Query linguistic datasets from the command line.""" @search.command(name="query") diff --git a/src/glazing/initialize.py b/src/glazing/initialize.py index 6336fc7..b043cae 100644 --- a/src/glazing/initialize.py +++ b/src/glazing/initialize.py @@ -263,7 +263,7 @@ def check_initialization(data_dir: Path | None = None) -> bool: @click.option("--force", is_flag=True, help="Force re-download even if data exists") @click.option("--quiet", is_flag=True, help="Suppress output messages") def main(data_dir: str | Path | None, force: bool, quiet: bool) -> None: - """Initialize glazing datasets by downloading and converting them.""" + """Set up all datasets. Downloads raw data and converts to JSON Lines format.""" # Convert to Path if provided if data_dir is not None: data_dir = Path(data_dir) diff --git a/src/glazing/references/models.py b/src/glazing/references/models.py index ba2ae13..d0db563 100644 --- a/src/glazing/references/models.py +++ b/src/glazing/references/models.py @@ -21,7 +21,7 @@ VerbNetFrameNetRoleMapping Role-level mapping between VerbNet and FrameNet. VerbNetCrossRefs - Enhanced VerbNet member cross-references. + VerbNet member cross-references. PropBankCrossRefs PropBank roleset cross-references with confidence. PropBankRoleMapping @@ -150,7 +150,7 @@ def validate_score(cls, v: float) -> float: class CrossReference(BaseModel): - """Enhanced cross-dataset reference with full metadata. + """Cross-dataset reference with full metadata. Attributes ---------- @@ -319,7 +319,7 @@ class VerbNetFrameNetRoleMapping(BaseModel): class VerbNetCrossRefs(BaseModel): - """Enhanced VerbNet member cross-references. + """VerbNet member cross-references. Attributes ---------- @@ -713,7 +713,7 @@ def get_combined_score(self) -> float: class FERelation(BaseModel): - """Enhanced FE mapping between related frames with alignment metadata. + """FE mapping between related frames with alignment metadata. Attributes ---------- diff --git a/src/glazing/verbnet/gl_models.py b/src/glazing/verbnet/gl_models.py index 365f87d..dc41676 100644 --- a/src/glazing/verbnet/gl_models.py +++ b/src/glazing/verbnet/gl_models.py @@ -7,9 +7,9 @@ Classes ------- GLVerbClass - VerbNet class enhanced with Generative Lexicon features. + VerbNet class with Generative Lexicon features. GLFrame - Frame enhanced with GL event structure and qualia. + Frame with GL event structure and qualia. Subcategorization GL subcategorization with variable assignments. SubcatMember @@ -266,7 +266,7 @@ class Subcategorization(GlazingBaseModel): class GLFrame(GlazingBaseModel): - """Frame enhanced with GL event structure and qualia. + """Frame with GL event structure and qualia. Attributes ---------- @@ -300,14 +300,14 @@ class GLFrame(GlazingBaseModel): class GLVerbClass(GlazingBaseModel): - """VerbNet class enhanced with Generative Lexicon features. + """VerbNet class with Generative Lexicon features. Attributes ---------- verb_class : VerbClass Base VerbNet class. gl_frames : list[GLFrame] - List of GL-enhanced frames. + List of GL frames. Methods ------- diff --git a/src/glazing/verbnet/models.py b/src/glazing/verbnet/models.py index a0bd865..47d6c33 100644 --- a/src/glazing/verbnet/models.py +++ b/src/glazing/verbnet/models.py @@ -21,7 +21,7 @@ MappingMetadata Metadata for cross-dataset mappings. Member - VerbNet member with comprehensive cross-references. + VerbNet member with cross-references. VerbClass A VerbNet verb class with members and frames. VNFrame @@ -347,7 +347,7 @@ def from_percentage_notation(cls, notation: str) -> Self: class Member(GlazingBaseModel): - """VerbNet member with comprehensive cross-references. + """VerbNet member with cross-references. Attributes ---------- diff --git a/src/glazing/wordnet/relations.py b/src/glazing/wordnet/relations.py index 52050d3..78e1a37 100644 --- a/src/glazing/wordnet/relations.py +++ b/src/glazing/wordnet/relations.py @@ -1,6 +1,6 @@ """WordNet relation traversal functionality. -This module provides comprehensive relation traversal capabilities for WordNet, +This module provides relation traversal capabilities for WordNet, including hypernym/hyponym chains, meronym/holonym navigation, entailment and causation relations, and similarity measure calculations. """