Proposition for parsing multi-character diacritics (see issue #45)#46
Proposition for parsing multi-character diacritics (see issue #45)#46XachaB wants to merge 1 commit intocldf-clts:masterfrom
Conversation
| features[self._feature_values[feature]] = feature | ||
| grapheme += dia[1] | ||
| sound += self.features[base_sound.type][feature][1] | ||
| for dia in post_dia_regex.split(post): |
There was a problem hiding this comment.
I considered refactoring these two nearly identical blocks to avoid repetition, but since I am not an usual contributor to pyclts, I opted for staying as close as possible to the current implementation.
|
Sorry, @XachaB, I only now saw this PR. I have been very busy lately. I'll check this later next week and also answer on the issue. |
Codecov Report
@@ Coverage Diff @@
## master #46 +/- ##
=======================================
Coverage 94.65% 94.66%
=======================================
Files 33 33
Lines 1760 1780 +20
=======================================
+ Hits 1666 1685 +19
- Misses 94 95 +1
Continue to review full report at Codecov.
|
LinguList
left a comment
There was a problem hiding this comment.
@XachaB, I am hesitant about this extension. While I have no objections against the code so far (although I'd have to check it more closely, what I cannot do by now), I do not know whether it is useful or not to allow cases of combined diacritics. Regarding the number of sounds that we have there, it may not seem that it is needed by now, if it is only for one diacritic of ultra-long.
So I'd ask @xrotwang for an additional opinion, and also @cormacanderson, who's involved in coding up things in the data, what they think about handling diacritic combinations as a feature per se in the diacritics table.
Generally, it makes sense, but I am always hesitant to add more code, as it is already not easy to keep track with what the system can generate and what it fails.
|
I think I'd agree with @LinguList that explicitly listing ultra-long phonemes in the respective data files would be more transparent, than wholesale acceptance of stacked diacritics. While CLTS BIPA isn't as strict as e.g. Concepticon in only aggregating phonemes that have been encountered "in the wild", I'd still say that stacked diacritics may be more often signaling problems with the data than actual phonemes. In addition, I think that the code in |
|
Ok, that makes sense. I understand well the wish not to complexify the system, especially for a single two-char diacritic. Then I won't lose time trying to find why there's a test that's not passing -- glad I didn't do so earlier too. Can I then make a PR to CLTS (not pyclts) for a set of extra-long sounds in the respective bipa data files ? I do need some way for these extra-long diacritics not to be ignored by pyclts. |
|
Sure, @XachaB. thanks in advance for your help here! We look forward to the PR in CLTS! |
This is a potential implementation for parsing multi-character diacritics, such as ultra-long (see issue #45 )
Currently, diacritics are parsed by iterating on each character leftover after matching sounds. This does not allow for recognizing multi-char diacritics.
One alternative is to construct regexes for diacritics (for each type of sound, a regex for pre diacritics, a regex for post diacritics), and use them to split the string of remaining diacritics. This is what I am doing here, with the exact same functionality otherwise.