Skip to content

Special token formatting #3

@tdozat

Description

@tdozat

The config file dictates what special tokens are used by each vocabulary. This is because the parser needs to know which token in the training file(s) is the root. In SD and UD, this is root, but in CTB and some CoNLL 2009 treebanks, it's ROOT. This means we can't just hardcode in which label string indicates the root relation. In a previous implementation of the parser the root string can be specified, but in this one you specify the format of all special tokens to allow for consistency; however, this opens up the possibility of leaving out some special tokens that the code assumes are there, or including ones that the code never uses.

A better approach is to hardcode in what the special tokens are for each vocabulary but let the configuration file specify what the format for them is, allowing for the following possibilities:

  1. Upper (e.g. ROOT)
  2. Proper (e.g. Root)
  3. Lower (e.g. root)
  4. Upper HTML (e.g. <ROOT>)
  5. Proper HTML (e.g. <Root>)
  6. Lower HTML (e.g. <root>)

Changing the special_tokens option to special_token_case and special_token_html should fix this, but it'll break older models.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions