Skip to content

Multi-segment graphemes #34

@xrotwang

Description

@xrotwang

Handling of multi-segment graphemes is too intransparent. Whenever I look at the segments code, I find it difficult to wrap my head around the layers of string splitting/concatenating. To me, it would seem natural that internally, the data the tokenizer creates are lists of lists:

[
    ['f', 'i', 'r', 's', 't'],
    ['w', 'o', 'r', 'd']
]

And whenever I try to implement it this way, I hit a wall, because the fact that even internally, the data looks like

"f i r s t # w o r d"

is actually exploited (and relied upon) in the orthography profiles: To specify a grapheme that is to be split into two segments, you could use this profile:

Grapheme Out
sch      s ch

But that's cheating, or hacky, because when the parser encounters sch, it should be appending two segments to the output, but instead appends one "segment" s ch which just happens to look exactly like two segments in the output.

Even worse, there is no way to specify special cases using multi-segment graphemes in profiles. E.g. to differentiate the segmentation of sch in German "bischen" from "naschen" one has to use something like

Grapheme Out
bischen  b i s ch e n
sch      sch

Wouldn't it be cool, if the same could be had with a profile like

Grapheme
b-i-s-ch-e-n
sch
ch
b
i
s
e
n

I think, with CSVW and the nice separator property, multi-segment graphemes could be handled fully transparently:

The profile above could be described as

{
    "name": "Grapheme",
    "propertyUrl": "http://cldf.clld.org/grapheme",
    "separator": "-"
}

Then the parser would read the first line as

grapheme = ['b', 'i', 's', 'ch', 'e', 'n']

Processing would happen as follows:

  • use ''.join(grapheme) for matching
  • for each match append the list of segments (i.e. grapheme) to the output

With this scheme, we could even have cross-word graphemes, e.g.

Grapheme Out
u k      u-k

would tokenize zu klein as ['z', 'u k', 'l', 'e', 'i', 'n'] and transliterate it as ['z', 'u', 'k', 'l', 'e', 'i', 'n']. While somewhat artificial, this could be used to deal with degenerate cases in lexibank, where we sometimes get multi-word expression when we only expect a single lexeme.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions