Skip to content

feat: Add symbol parsing, fuzzy search, and syntax-based search#3

Merged
aaronstevenwhite merged 26 commits intomainfrom
feature/symbol-parsing-and-fuzzy-search
Sep 30, 2025
Merged

feat: Add symbol parsing, fuzzy search, and syntax-based search#3
aaronstevenwhite merged 26 commits intomainfrom
feature/symbol-parsing-and-fuzzy-search

Conversation

@aaronstevenwhite
Copy link
Collaborator

Description

This PR introduces symbol parsing capabilities, fuzzy string matching, and syntax-based search functionality to the Glazing package. These features enable users to parse linguistic identifiers, perform approximate string matching, and search across datasets using syntactic patterns.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix
  • Breaking change
  • Documentation update

Changes Made

Core Features

  • Symbol parsers for all four datasets (FrameNet, PropBank, VerbNet, WordNet)
  • Fuzzy string matching using Levenshtein distance algorithm
  • Syntax-based search interface for pattern matching
  • Transitive cross-reference resolution with MappingIndex

Implementation Details

  • Symbol Parsing: Parse identifiers like give.01 (PropBank) or 02756558-n (WordNet)
  • Fuzzy Matching: Configurable similarity thresholds with Levenshtein distance
  • Syntax Search: Pattern-based search (e.g., "NP V NP PP[to]")
  • Cross-References: New MappingIndex class with file-based caching

New Files Added (12 new modules)

  • src/glazing/syntax/ - Syntactic pattern models and parser
  • src/glazing/utils/fuzzy_match.py - Fuzzy matching utilities
  • src/glazing/utils/ranking.py - Result ranking functions
  • src/glazing/references/index.py - Mapping index implementation
  • src/glazing/symbols.py - Symbol type definitions
  • src/glazing/{dataset}/symbol_parser.py - Dataset parsers (4 files)

Documentation Added (18 docs)

  • docs/user-guide/fuzzy-search.md - Fuzzy search guide
  • docs/user-guide/syntax-search.md - Syntax search guide
  • docs/api/{dataset}/symbol-parser.md - API docs (4 files)
  • docs/api/utils/fuzzy-match.md - Fuzzy match API
  • docs/api/symbols.md - Symbols API
  • Updated data-formats.md with JSON schemas

Testing

  • 1,333 total tests pass
  • 28 test files modified/added
  • Type checking: mypy --strict src/ passes
  • Linting: ruff check passes
  • Formatting: ruff format validated

Dependencies

  • python-Levenshtein>=0.20.0 added to dependencies

Docker Support

  • Dockerfile added with multi-stage build
  • Python 3.13 base image
  • Automated data download during build

Code Examples

Symbol Parsing

>>> from glazing.propbank.symbol_parser import parse_roleset_id
>>> result = parse_roleset_id("give.01")
>>> result.lemma
'give'
>>> result.sense_number
1

Fuzzy Search

>>> from glazing.search import UnifiedSearch
>>> searcher = UnifiedSearch()
>>> results = searcher.search_with_fuzzy("comunication", fuzzy_threshold=0.8)
>>> # Results include frames with similar names like "Communication"

Syntax Search

>>> from glazing.search import UnifiedSearch
>>> searcher = UnifiedSearch()
>>> # Search for ditransitive patterns (e.g., "NP gives NP to NP")
>>> results = searcher.search_by_syntax(
...     pattern="NP V NP PP[to]",
...     allow_wildcards=True
... )

…d morphological feature matching case-insensitive.
@aaronstevenwhite aaronstevenwhite merged commit f708c7d into main Sep 30, 2025
9 checks passed
@aaronstevenwhite aaronstevenwhite deleted the feature/symbol-parsing-and-fuzzy-search branch September 30, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant