Minimal refactoring to user char[] instead of string for core processing by EslaMx7 · Pull Request #14 · Lokad/Tokenizers

EslaMx7 · 2024-10-15T05:33:46Z

Summary

This is a basic minimal effort to maintain unit test functionality after transitioning to char[]. I aim to highlight the specific areas in the code that are heavily reliant on string. For some of these areas, we will need to reconstruct/allocate new strings, while for others, we might have to rethink the logic to accommodate char[].

Open Questions:

Should we convert the string properties of all other classes to char[], similar to Node.Text?
What methods/tools should we use to evaluate the performance enhancements (speed, memory allocation, etc...) following this refactoring?

"If you can’t measure it, you can’t improve it." – Peter Drucker

Currently, I can track the total duration of all unit tests, though it's not a precise metric, as each tokenization test case runs for less than one second.

Changes

Refactor Token class to use char[] instead of string

Updated Token.Text from string to char[], requiring extensive changes across the codebase. Methods now handle char[] and ReadOnlySpan<char> for text manipulation.
Updated constructors, methods, and utilities to support the new type.
Added extension methods for SpanRuneEnumerator to List conversion.
Ensured all text processing functions in TokenizationUtils, BaseTokenizer, XLMRobertaTokenizer, and SentencePieceModel are compatible with char[].

Updated Token.Text from string to char[], requiring extensive changes across the codebase. Methods now handle char[] and ReadOnlySpan<char> for text manipulation. Updated constructors, methods, and utilities to support the new type. Added extension methods for SpanRuneEnumerator to List<Rune> conversion. Ensured all text processing functions in TokenizationUtils, BaseTokenizer, XLMRobertaTokenizer, and SentencePieceModel are compatible with char[].

Refactored various methods and classes to use `char[]` instead of `string` for improved performance and memory efficiency. Updated method signatures, parameter types, and internal logic accordingly. - `BaseTokenizer.cs`: Updated `SplitOnSpecialTokens` and `SplitOnSubstr` to use `Func<char[], (int, int, Mask)>`. - `TokenizationUtils`: Removed and reintroduced `ToList` extension for `SpanRuneEnumerator`. Refactored methods like `SubstringRunes`, `GetUtf8BytesCount`, and `SubstringByByteOffset` to work with `char[]`. - `SentencePieceUnigramModel.cs`: Modified token text processing to use `char[]`, including `SubstringByByteOffset` method calls. - `TokenizationUtilsTests`: Updated test cases to convert `string` to `char[]`.

EslaMx7 added 2 commits October 14, 2024 13:58

EslaMx7 added the enhancement New feature or request label Oct 15, 2024

EslaMx7 requested review from ClementLokad and vermorel October 15, 2024 05:33

EslaMx7 self-assigned this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal refactoring to user char[] instead of string for core processing#14

Minimal refactoring to user char[] instead of string for core processing#14
EslaMx7 wants to merge 2 commits intomasterfrom
perf-string-to-chars

EslaMx7 commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EslaMx7 commented Oct 15, 2024

Summary

Open Questions:

Changes

Refactor Token class to use char[] instead of string

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant