Look into tokenization

The original BERT was trained with raw text, and punctuation marks were generally seen attached to words. In emBERT, we take the output of emToken, so punctuation marks are tokens in their own right. This discrepancy _might_ affect performance.

1. Check if this is really the case. The basic tokenization procedure _does_ split punctuation from the end of words, so the problem might not be as acute as it seems at first sight.
2. Merge punctuation tokens with the words before sending them to the BERT model.
3. Alternatively, skip emToken altogether?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into tokenization #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Look into tokenization #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions