Ensure tokens don't end up with leading or trailing whitespace #79

PeterReid · 2025-05-13T02:59:38Z

Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.

There are a few places it would have been possible to add this strip() call, but I tried to choose the most sensible one. Where ever it goes, it should be somewhere that affects both the lexicon check and the fallback. As an example of what was going on.

"Sentence(SPACE)one.(SPACE)(SPACE)And(SPACE)two." would end up trying to find " And" in the lexicon, not finding it, and then passing that string to the fallback. The transformer-based fallback has no idea what to make of that leading space (as it's not in the training set at all) and was making it into random-seeming phonemes.

Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.

- PR hexgrad#90: Restrict spacy<4 to avoid pre-release/yanked versions Fixes Python 3.13 compatibility issues with thinc/blis dependencies - PR hexgrad#79: Strip whitespace from merged tokens Fixes lexicon lookup failures when multiple spaces appear between words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ensure tokens don't end up with leading or trailing whitespace #79

Ensure tokens don't end up with leading or trailing whitespace #79

Uh oh!

PeterReid commented May 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Ensure tokens don't end up with leading or trailing whitespace #79

Are you sure you want to change the base?

Ensure tokens don't end up with leading or trailing whitespace #79

Uh oh!

Conversation

PeterReid commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PeterReid commented May 13, 2025 •

edited

Loading