Commit 97df007
committed
Adopt uax29 segmenter
Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1).
Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter:
- The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token.
- The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update.
uax 291 parent 195a44a commit 97df007
File tree
5 files changed
+566
-109
lines changed- analysis/tokenizer/unicode
- testdata
5 files changed
+566
-109
lines changed
0 commit comments