Multi-segment graphemes

Handling of multi-segment graphemes is too intransparent. Whenever I look at the `segments` code, I find it difficult to wrap my head around the layers of string splitting/concatenating. To me, it would seem natural that internally, the data the tokenizer creates are lists of lists:
```python
[
    ['f', 'i', 'r', 's', 't'],
    ['w', 'o', 'r', 'd']
]
```

And whenever I try to implement it this way, I hit a wall, because the fact that even internally, the data looks like
```python
"f i r s t # w o r d"
```
is actually exploited (and relied upon) in the orthography profiles: To specify a grapheme that is to be split into two segments, you could use this profile:
```
Grapheme Out
sch      s ch
```

But that's cheating, or hacky, because when the parser encounters `sch`, it should be appending **two** segments to the output, but instead appends **one** "segment" `s ch` which just happens to look exactly like two segments in the output.

Even worse, there is no way to specify special cases using multi-segment graphemes in profiles. E.g. to differentiate the segmentation of `sch` in German "bischen" from "naschen" one has to use something like
```
Grapheme Out
bischen  b i s ch e n
sch      sch
```
Wouldn't it be cool, if the same could be had with a profile like
```
Grapheme
b-i-s-ch-e-n
sch
ch
b
i
s
e
n
```

I think, with CSVW and the nice [separator](https://www.w3.org/TR/tabular-metadata/#cell-separator) property, multi-segment graphemes could be handled fully transparently:

The profile above could be described as
```python
{
    "name": "Grapheme",
    "propertyUrl": "http://cldf.clld.org/grapheme",
    "separator": "-"
}
```
Then the parser would read the first line as
```python
grapheme = ['b', 'i', 's', 'ch', 'e', 'n']
```
Processing would happen as follows:
- use `''.join(grapheme)` for matching
- for each match append the list of segments (i.e. `grapheme`) to the output

With this scheme, we could even have cross-word graphemes, e.g.
```
Grapheme Out
u k      u-k
```
would tokenize `zu klein` as `['z', 'u k', 'l', 'e', 'i', 'n']` and transliterate it as `['z', 'u', 'k', 'l', 'e', 'i', 'n']`. While somewhat artificial, this could be used to deal with degenerate cases in lexibank, where we sometimes get multi-word expression when we only expect a single lexeme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-segment graphemes #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-segment graphemes #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions