Skip to content

Commit 97df007

Browse files
committed
Adopt uax29 segmenter
Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1). Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter: - The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token. - The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update. uax 29
1 parent 195a44a commit 97df007

File tree

5 files changed

+566
-109
lines changed

5 files changed

+566
-109
lines changed

0 commit comments

Comments
 (0)