Minimal refactoring to user char[] instead of string for core processing#14
Open
Minimal refactoring to user char[] instead of string for core processing#14
Conversation
Updated Token.Text from string to char[], requiring extensive changes across the codebase. Methods now handle char[] and ReadOnlySpan<char> for text manipulation. Updated constructors, methods, and utilities to support the new type. Added extension methods for SpanRuneEnumerator to List<Rune> conversion. Ensured all text processing functions in TokenizationUtils, BaseTokenizer, XLMRobertaTokenizer, and SentencePieceModel are compatible with char[].
Refactored various methods and classes to use `char[]` instead of `string` for improved performance and memory efficiency. Updated method signatures, parameter types, and internal logic accordingly. - `BaseTokenizer.cs`: Updated `SplitOnSpecialTokens` and `SplitOnSubstr` to use `Func<char[], (int, int, Mask)>`. - `TokenizationUtils`: Removed and reintroduced `ToList` extension for `SpanRuneEnumerator`. Refactored methods like `SubstringRunes`, `GetUtf8BytesCount`, and `SubstringByByteOffset` to work with `char[]`. - `SentencePieceUnigramModel.cs`: Modified token text processing to use `char[]`, including `SubstringByByteOffset` method calls. - `TokenizationUtilsTests`: Updated test cases to convert `string` to `char[]`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a basic minimal effort to maintain unit test functionality after transitioning to
char[]. I aim to highlight the specific areas in the code that are heavily reliant onstring. For some of these areas, we will need to reconstruct/allocate new strings, while for others, we might have to rethink the logic to accommodatechar[].Open Questions:
char[], similar toNode.Text?Currently, I can track the total duration of all unit tests, though it's not a precise metric, as each tokenization test case runs for less than one second.
Changes
Refactor Token class to use char[] instead of string
char[], requiring extensive changes across the codebase. Methods now handlechar[]andReadOnlySpan<char>for text manipulation.SpanRuneEnumeratorto List conversion.TokenizationUtils,BaseTokenizer,XLMRobertaTokenizer, andSentencePieceModelare compatible withchar[].