Skip to content

fix: Fixes FrameNet lexical unit loading#5

Merged
aaronstevenwhite merged 1 commit intomainfrom
fix/framenet-lexical-units-loading
Oct 28, 2025
Merged

fix: Fixes FrameNet lexical unit loading#5
aaronstevenwhite merged 1 commit intomainfrom
fix/framenet-lexical-units-loading

Conversation

@aaronstevenwhite
Copy link
Collaborator

@aaronstevenwhite aaronstevenwhite commented Oct 28, 2025

Fix FrameNet Lexical Unit Loading

Fixes #4

Description

This PR fixes a critical data completeness issue where FrameNet lexical units were not being loaded during dataset conversion. All frames had empty lexical_units fields despite the raw FrameNet data containing 13,575 lexical units. This fix parses lexical units from luIndex.xml and properly associates them with their frames during conversion.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Key Changes

Lexical Unit Loading

  • Added LU parsing: New methods to parse lexical units from luIndex.xml
  • Frame association: LUs now properly associated with frames by frame_id
  • Complete metadata: Preserves POS tags, annotation status, sentence counts, lexeme structures
  • High success rate: 13,572 out of 13,575 LUs successfully parsed (99.98%)

Validation Updates

  • Relaxed patterns: Updated validators to handle real-world FrameNet data
  • Proper nouns: Now accepts "April.n", "Monday.n"
  • Multi-word expressions: Now accepts "a bit.n", "give up.v"
  • Special characters: Now accepts "(can't) help.v", "American [N and S Am].n"

Version Bump

  • Version: 0.2.0 → 0.2.1 (patch release)
  • Updated: All version references across package and documentation

Problem

Frames had empty lexical_units fields after conversion:

>>> from glazing.framenet.loader import FrameNetLoader
>>> loader = FrameNetLoader()
>>> index = loader.build_frame_index(loader.frames)
>>> frame = index.get_frame_by_name("Abandonment")
>>> len(frame.lexical_units)
0  # Should be 5!

The converter only parsed frame XML files and never loaded lexical unit data from luIndex.xml.

Solution

1. Added LU Parsing Methods

_parse_lu_from_index() - Parse individual LU from XML element:

  • Extract metadata (ID, name, POS, frame association, annotation status)
  • Create lexemes from multi-word expressions
  • Build sentence count statistics

convert_lu_index_file() - Convert entire luIndex.xml:

  • Parse all LU elements from index
  • Handle errors gracefully with warnings
  • Return list of LexicalUnit models

2. Updated Frame Conversion

Modified convert_frames_directory() to:

  1. Parse all frame XML files
  2. Load lexical units from luIndex.xml (in parent directory)
  3. Group LUs by frame_id
  4. Associate LUs with their frames

3. Relaxed Validators

Before:

LU_NAME_PATTERN = r"^[a-z][a-z0-9_\'-]*\.[a-z]+$"  # Too strict
LEXEME_NAME_PATTERN = r"^[A-Za-z][a-zA-Z0-9\'-]*$"  # Too strict

After:

LU_NAME_PATTERN = r"^.+\.[a-z]+$"  # Permissive: anything.pos
LEXEME_NAME_PATTERN = r"^.+$"      # Permissive: any non-empty string

Impact

Before This Fix

  • ✗ All frames had empty lexical_units fields
  • ✗ Could not query frames by lexical unit name
  • ✗ ~1,100+ LUs rejected by strict validators (91.3% success)
  • ✗ Missing critical FrameNet data

After This Fix

  • ✓ All frames include their lexical units with complete metadata
  • ✓ Can query frames by lexical unit name via frame index
  • ✓ Only 3 LUs rejected (99.98% success)
  • ✓ Complete FrameNet data coverage

Example

>>> from glazing.framenet.loader import FrameNetLoader
>>> loader = FrameNetLoader()
>>> index = loader.build_frame_index(loader.frames)
>>> frame = index.get_frame_by_name("Abandonment")
>>> len(frame.lexical_units)
5
>>> frame.lexical_units[0].name
'abandon.v'
>>> frame.lexical_units[0].pos
'V'

Files Changed

Core Implementation

  • src/glazing/framenet/converter.py - Added LU parsing and frame association
  • src/glazing/framenet/types.py - Relaxed validation patterns
  • src/glazing/cli/convert.py - Updated CLI progress messages

Tests

  • tests/test_framenet/test_converter.py - Added 5 new LU parsing tests
  • tests/test_framenet/test_types.py - Updated validation tests
  • tests/test_framenet/test_models.py - Updated model tests

Version and Documentation

  • pyproject.toml - Version bump to 0.2.1
  • src/glazing/__version__.py - Version bump to 0.2.1
  • CHANGELOG.md - Added 0.2.1 entry
  • docs/ - Updated all version references (9 files)
  • .pre-commit-config.yaml - Fixed Python 3.13 compatibility for hooks

Testing

All tests pass with comprehensive coverage:

pytest tests/ -v --tb=short -q
# 1,338 tests passed, 81% code coverage

mypy --strict src/
# Success: no issues found

ruff check src/ tests/
# All checks passed

ruff format src/ tests/
# All files formatted correctly

Compatibility

  • ✓ Fully backwards compatible with v0.2.0
  • ✓ No API changes
  • ✓ No breaking changes
  • ⚠️ Users must reconvert FrameNet data to populate lexical units:
    glazing init --force

Checklist

  • Code follows project style guidelines (ruff, mypy)
  • All tests pass locally
  • Added tests for new functionality
  • Updated documentation
  • Updated CHANGELOG.md
  • Version bumped appropriately (0.2.0 → 0.2.1)
  • Release notes prepared

…xing validators to handle all actual data.
@aaronstevenwhite aaronstevenwhite merged commit 25044e1 into main Oct 28, 2025
9 checks passed
@aaronstevenwhite aaronstevenwhite deleted the fix/framenet-lexical-units-loading branch October 28, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FrameNet Lexical Units Not Loaded in Converted Data

1 participant