Skip to content

Help with ModernBert tokenizer (BPE+special tokens) #8

@svilupp

Description

@svilupp

@chengchingwen, would you mind sharing some pointers on implementing ModernBert tokenizer?

I've tried it both from scratch and with your package and I can't get either to work -- example below.

Source: https://huggingface.co/answerdotai/ModernBERT-base/blob/main/tokenizer_config.json
My attempt: https://github.com/svilupp/ModernBert.jl

I tried with your package to load the merges and mimic how you use it in some of the tests/examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/src/bytepair.jl#L164

Failing examples: https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/examples/verify.jl#L34

text = "The capital of France is [MASK]."
tokens = tokenize(tokenizer, text)
@test tokens ==
      ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"

I struggle with catching the special token [MASK].

Test Failed at /Users/simljx/Documents/GitHub/ModernBert.jl/examples/verify.jl:34
  Expression: tokens2 == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", " [MASK]", ".", "[SEP]"]
   Evaluated: ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", "Ġ", **"[MASK]",** ".", "[SEP]"] == ["[CLS]", "The", "Ġcapital", "Ġof", "ĠFrance", "Ġis", **" [MASK]",** ".", "[SEP]"]

I tried using MatchTokenizer and adding the special tokens. I also tried introducing my own tokenizer (MaskTokenizer) to manually fix it, but it's too high up in the stack -- it makes no difference:

# Create tokenizer pipeline
    base_tokenizer = BPE(bpe_merges)
    tokenizer = BPETokenizer(
        TextEncodeBase.MatchTokenization(
        MaskTokenization(
            CodeNormalizer(
                BPETokenization(
                    GPT2Tokenization(),
                    base_tokenizer
                ),
                gpt2_codemap()
            ),
            "[MASK]"),
        collect(keys(special_tokens))
    )
    )

Without match tokenization, it splits up the special tokens, eg,

 "Ġ[", "MASK", "]."

Tokenizer setting:

"normalizer": {
"type": "NFC"
},
"pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
},

Would you have any pointers on where to start? I'm not sure what else to start

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions