-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Handling of multi-segment graphemes is too intransparent. Whenever I look at the segments code, I find it difficult to wrap my head around the layers of string splitting/concatenating. To me, it would seem natural that internally, the data the tokenizer creates are lists of lists:
[
['f', 'i', 'r', 's', 't'],
['w', 'o', 'r', 'd']
]And whenever I try to implement it this way, I hit a wall, because the fact that even internally, the data looks like
"f i r s t # w o r d"is actually exploited (and relied upon) in the orthography profiles: To specify a grapheme that is to be split into two segments, you could use this profile:
Grapheme Out
sch s ch
But that's cheating, or hacky, because when the parser encounters sch, it should be appending two segments to the output, but instead appends one "segment" s ch which just happens to look exactly like two segments in the output.
Even worse, there is no way to specify special cases using multi-segment graphemes in profiles. E.g. to differentiate the segmentation of sch in German "bischen" from "naschen" one has to use something like
Grapheme Out
bischen b i s ch e n
sch sch
Wouldn't it be cool, if the same could be had with a profile like
Grapheme
b-i-s-ch-e-n
sch
ch
b
i
s
e
n
I think, with CSVW and the nice separator property, multi-segment graphemes could be handled fully transparently:
The profile above could be described as
{
"name": "Grapheme",
"propertyUrl": "http://cldf.clld.org/grapheme",
"separator": "-"
}Then the parser would read the first line as
grapheme = ['b', 'i', 's', 'ch', 'e', 'n']Processing would happen as follows:
- use
''.join(grapheme)for matching - for each match append the list of segments (i.e.
grapheme) to the output
With this scheme, we could even have cross-word graphemes, e.g.
Grapheme Out
u k u-k
would tokenize zu klein as ['z', 'u k', 'l', 'e', 'i', 'n'] and transliterate it as ['z', 'u', 'k', 'l', 'e', 'i', 'n']. While somewhat artificial, this could be used to deal with degenerate cases in lexibank, where we sometimes get multi-word expression when we only expect a single lexeme.