Simple, fast Burmese text tokenization. No fancy stuff, just gets the job done.
pip install burmese-tokenizerfrom burmese_tokenizer import BurmeseTokenizer
tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"
# tokenize
tokens = tokenizer.encode(text)
print(tokens["pieces"])
# ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']
# decode back
text = tokenizer.decode(tokens["pieces"])
print(text)
# မင်္ဂလာပါ။ နေကောင်းပါသလား။# tokenize
burmese-tokenizer "မင်္ဂလာပါ။"
# show details
burmese-tokenizer -v "မင်္ဂလာပါ။"
# decode tokens
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"encode(text)- tokenize textdecode(pieces)- convert tokens back to textdecode_ids(ids)- convert ids to textget_vocab_size()- vocabulary sizeget_vocab()- full vocabulary
MIT - Do whatever you want with it.