Help with ModernBert tokenizer (BPE+special tokens)

@chengchingwen, would you mind sharing some pointers on implementing ModernBert tokenizer?

I've tried it both from scratch and with your package and I can't get either to work -- example below.

Source: https://huggingface.co/answerdotai/ModernBERT-base/blob/main/tokenizer_config.json
My attempt: https://github.com/svilupp/ModernBert.jl

I tried with your package to load the merges and mimic how you use it in some of the tests/examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/src/bytepair.jl#L164

Failing examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/examples/verify.jl#L34
```julia
text = "The capital of France is [MASK]."
tokens = tokenize(tokenizer, text)
@test tokens ==
      ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"
```
I struggle with catching the special token `[MASK]`.
```plaintext
Test Failed at /Users/simljx/Documents/GitHub/ModernBert.jl/examples/verify.jl:34
  Expression: tokens2 == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"]
   Evaluated: ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", "Ġ", **"[MASK]",** ".", "[SEP]"] == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", **" [MASK]",** ".", "[SEP]"]
```

I tried using `MatchTokenizer` and adding the special tokens. I also tried introducing my own tokenizer (`MaskTokenizer`) to manually fix it, but it's too high up in the stack -- it makes no difference:
```julia
# Create tokenizer pipeline
    base_tokenizer = BPE(bpe_merges)
    tokenizer = BPETokenizer(
        TextEncodeBase.MatchTokenization(
        MaskTokenization(
            CodeNormalizer(
                BPETokenization(
                    GPT2Tokenization(),
                    base_tokenizer
                ),
                gpt2_codemap()
            ),
            "[MASK]"),
        collect(keys(special_tokens))
    )
    )
```

Without match tokenization, it splits up the special tokens, eg, 
```plaintext
 "Ġ[", "MASK", "]."
```

Tokenizer setting:

>  "normalizer": {
>     "type": "NFC"
>   },
>   "pre_tokenizer": {
>     "type": "ByteLevel",
>     "add_prefix_space": false,
>     "trim_offsets": true,
>     "use_regex": true
>   },


Would you have any pointers on where to start? I'm not sure what else to start

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with ModernBert tokenizer (BPE+special tokens) #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Help with ModernBert tokenizer (BPE+special tokens) #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions