-
Notifications
You must be signed in to change notification settings - Fork 4
Description
@chengchingwen, would you mind sharing some pointers on implementing ModernBert tokenizer?
I've tried it both from scratch and with your package and I can't get either to work -- example below.
Source: https://huggingface.co/answerdotai/ModernBERT-base/blob/main/tokenizer_config.json
My attempt: https://github.com/svilupp/ModernBert.jl
I tried with your package to load the merges and mimic how you use it in some of the tests/examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/src/bytepair.jl#L164
Failing examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/examples/verify.jl#L34
text = "The capital of France is [MASK]."
tokens = tokenize(tokenizer, text)
@test tokens ==
["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"I struggle with catching the special token [MASK].
Test Failed at /Users/simljx/Documents/GitHub/ModernBert.jl/examples/verify.jl:34
Expression: tokens2 == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"]
Evaluated: ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", "Ġ", **"[MASK]",** ".", "[SEP]"] == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", **" [MASK]",** ".", "[SEP]"]
I tried using MatchTokenizer and adding the special tokens. I also tried introducing my own tokenizer (MaskTokenizer) to manually fix it, but it's too high up in the stack -- it makes no difference:
# Create tokenizer pipeline
base_tokenizer = BPE(bpe_merges)
tokenizer = BPETokenizer(
TextEncodeBase.MatchTokenization(
MaskTokenization(
CodeNormalizer(
BPETokenization(
GPT2Tokenization(),
base_tokenizer
),
gpt2_codemap()
),
"[MASK]"),
collect(keys(special_tokens))
)
)Without match tokenization, it splits up the special tokens, eg,
"Ġ[", "MASK", "]."
Tokenizer setting:
"normalizer": {
"type": "NFC"
},
"pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
},
Would you have any pointers on where to start? I'm not sure what else to start