Ensure tokens don't end up with leading or trailing whitespace #79
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.
There are a few places it would have been possible to add this strip() call, but I tried to choose the most sensible one. Where ever it goes, it should be somewhere that affects both the lexicon check and the fallback. As an example of what was going on.
"Sentence(SPACE)one.(SPACE)(SPACE)And(SPACE)two." would end up trying to find " And" in the lexicon, not finding it, and then passing that string to the fallback. The transformer-based fallback has no idea what to make of that leading space (as it's not in the training set at all) and was making it into random-seeming phonemes.