Update regex.rs to not split words on combining marks and diacritics by ajaykg · Pull Request #4 · gnp/minbpe-rs

ajaykg · 2024-05-06T00:41:15Z

Over half the world population seems to speak languages that use unicode combining marks like accents and matras in between the words. GPT / tictoken regular expressions seem to break such words in between preventing merges of characters that should actually merge. Edited the regular expression to not split on such combining characters.

gnp · 2024-05-17T01:04:15Z

Previously the regex tokenizer defaulted to gpt4 identical (hopefully) behavior. Does your proposed change fix cases where we were not compatible with gpt4 and we just didn't have the test cases to prove it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Update regex.rs to not split words on combining marks and diacritics#4

Update regex.rs to not split words on combining marks and diacritics#4
ajaykg wants to merge 1 commit intognp:masterfrom
ajaykg:patch-1

ajaykg commented May 6, 2024

Uh oh!

gnp commented May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ajaykg commented May 6, 2024

Uh oh!

gnp commented May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants