Skip to content

Comments

Update regex.rs to not split words on combining marks and diacritics#4

Open
ajaykg wants to merge 1 commit intognp:masterfrom
ajaykg:patch-1
Open

Update regex.rs to not split words on combining marks and diacritics#4
ajaykg wants to merge 1 commit intognp:masterfrom
ajaykg:patch-1

Conversation

@ajaykg
Copy link

@ajaykg ajaykg commented May 6, 2024

Over half the world population seems to speak languages that use unicode combining marks like accents and matras in between the words. GPT / tictoken regular expressions seem to break such words in between preventing merges of characters that should actually merge. Edited the regular expression to not split on such combining characters.

Over half the world population seems to speak languages that use unicode combining marks like accents and matras in between the words. GPT / tictoken regular expressions seem to break such words in between preventing merges of characters that should actually merge. Edited the regular expression to not split on such combining characters.
@gnp
Copy link
Owner

gnp commented May 17, 2024

Previously the regex tokenizer defaulted to gpt4 identical (hopefully) behavior. Does your proposed change fix cases where we were not compatible with gpt4 and we just didn't have the test cases to prove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants