Skip to content

Comments

Parallelize training for RegexTokenizer#5

Open
SingularityT3 wants to merge 2 commits intognp:masterfrom
SingularityT3:master
Open

Parallelize training for RegexTokenizer#5
SingularityT3 wants to merge 2 commits intognp:masterfrom
SingularityT3:master

Conversation

@SingularityT3
Copy link

Calculating bigram frequencies (update_stats()) and merge() can easily be done in parallel when training a RegexTokenizer by splitting the ids vector into chunks and having multiple threads performing the operations simultaneously on each chunk. The order of the stats IndexMap is still preserved here as the threads are joined sequentially.

On a 75MB dataset, this change reduced the training time taken for 400 merges from ~8mins down to ~3.5min on a 6 core 12 thread CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant