Parallelize training for RegexTokenizer by SingularityT3 · Pull Request #5 · gnp/minbpe-rs

SingularityT3 · 2025-08-13T17:16:42Z

Calculating bigram frequencies (update_stats()) and merge() can easily be done in parallel when training a RegexTokenizer by splitting the ids vector into chunks and having multiple threads performing the operations simultaneously on each chunk. The order of the stats IndexMap is still preserved here as the threads are joined sequentially.

On a 75MB dataset, this change reduced the training time taken for 400 merges from ~8mins down to ~3.5min on a 6 core 12 thread CPU.

SingularityT3 added 2 commits August 13, 2025 22:15

Parallelize training for RegexTokenizer

479bab4

Switch to rayon, parallel encoding

62dc9b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Parallelize training for RegexTokenizer#5

Parallelize training for RegexTokenizer#5
SingularityT3 wants to merge 2 commits intognp:masterfrom
SingularityT3:master

SingularityT3 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

SingularityT3 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant