Skip to content

Conversation

@PeterReid
Copy link
Contributor

@PeterReid PeterReid commented May 13, 2025

Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.

There are a few places it would have been possible to add this strip() call, but I tried to choose the most sensible one. Where ever it goes, it should be somewhere that affects both the lexicon check and the fallback. As an example of what was going on.

"Sentence(SPACE)one.(SPACE)(SPACE)And(SPACE)two." would end up trying to find " And" in the lexicon, not finding it, and then passing that string to the fallback. The transformer-based fallback has no idea what to make of that leading space (as it's not in the training set at all) and was making it into random-seeming phonemes.

Previously, two spaces, for example between sentences, would lead
to the token following the spaces being prefixed by a space. That
would lead to it registering as not in the lexicon, and then passing
the prefixed word into the fallback.
joshwhiton added a commit to joshwhiton/misaki that referenced this pull request Dec 30, 2025
- PR hexgrad#90: Restrict spacy<4 to avoid pre-release/yanked versions
  Fixes Python 3.13 compatibility issues with thinc/blis dependencies

- PR hexgrad#79: Strip whitespace from merged tokens
  Fixes lexicon lookup failures when multiple spaces appear between words
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant