Proposition for parsing multi-character diacritics (see issue #45) by XachaB · Pull Request #46 · cldf-clts/pyclts

XachaB · 2021-11-01T18:51:44Z

This is a potential implementation for parsing multi-character diacritics, such as ultra-long (see issue #45 )

Currently, diacritics are parsed by iterating on each character leftover after matching sounds. This does not allow for recognizing multi-char diacritics.

One alternative is to construct regexes for diacritics (for each type of sound, a regex for pre diacritics, a regex for post diacritics), and use them to split the string of remaining diacritics. This is what I am doing here, with the exact same functionality otherwise.

src/pyclts/transcriptionsystem.py

XachaB · 2021-11-01T18:59:18Z

src/pyclts/transcriptionsystem.py

-            features[self._feature_values[feature]] = feature
-            grapheme += dia[1]
-            sound += self.features[base_sound.type][feature][1]
+        for dia in post_dia_regex.split(post):


I considered refactoring these two nearly identical blocks to avoid repetition, but since I am not an usual contributor to pyclts, I opted for staying as close as possible to the current implementation.

LinguList · 2021-11-12T12:25:34Z

Sorry, @XachaB, I only now saw this PR. I have been very busy lately. I'll check this later next week and also answer on the issue.

codecov-commenter · 2021-11-12T12:26:19Z

Codecov Report

Merging #46 (07b85b7) into master (c7420d9) will increase coverage by 0.00%.
The diff coverage is 97.05%.

@@           Coverage Diff           @@
##           master      #46   +/-   ##
=======================================
  Coverage   94.65%   94.66%           
=======================================
  Files          33       33           
  Lines        1760     1780   +20     
=======================================
+ Hits         1666     1685   +19     
- Misses         94       95    +1

Impacted Files	Coverage Δ
src/pyclts/transcriptionsystem.py	`96.55% <97.05%> (-0.17%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7420d9...07b85b7. Read the comment docs.

LinguList

@XachaB, I am hesitant about this extension. While I have no objections against the code so far (although I'd have to check it more closely, what I cannot do by now), I do not know whether it is useful or not to allow cases of combined diacritics. Regarding the number of sounds that we have there, it may not seem that it is needed by now, if it is only for one diacritic of ultra-long.

So I'd ask @xrotwang for an additional opinion, and also @cormacanderson, who's involved in coding up things in the data, what they think about handling diacritic combinations as a feature per se in the diacritics table.

Generally, it makes sense, but I am always hesitant to add more code, as it is already not easy to keep track with what the system can generate and what it fails.

src/pyclts/transcriptionsystem.py

xrotwang · 2021-11-18T12:08:42Z

I think I'd agree with @LinguList that explicitly listing ultra-long phonemes in the respective data files would be more transparent, than wholesale acceptance of stacked diacritics. While CLTS BIPA isn't as strict as e.g. Concepticon in only aggregating phonemes that have been encountered "in the wild", I'd still say that stacked diacritics may be more often signaling problems with the data than actual phonemes. In addition, I think that the code in transcriptionsystems is already more complex than I'd like it to be. So any additions to it feel like going into the wrong direction.

XachaB · 2021-11-18T12:14:12Z

Ok, that makes sense.

I understand well the wish not to complexify the system, especially for a single two-char diacritic. Then I won't lose time trying to find why there's a test that's not passing -- glad I didn't do so earlier too. Can I then make a PR to CLTS (not pyclts) for a set of extra-long sounds in the respective bipa data files ?

I do need some way for these extra-long diacritics not to be ignored by pyclts.

LinguList · 2021-11-18T12:25:52Z

Sure, @XachaB. thanks in advance for your help here! We look forward to the PR in CLTS!

Proposition for parsing multi-character diacritics

07b85b7

XachaB mentioned this pull request Nov 1, 2021

CLTS knows the "ultra-long" length, but pyclts only parses it for a closed list of vowels. #45

Open

XachaB commented Nov 1, 2021

View reviewed changes

src/pyclts/transcriptionsystem.py Show resolved Hide resolved

XachaB commented Nov 1, 2021

View reviewed changes

LinguList mentioned this pull request Nov 18, 2021

Syllabic modifier is empty in bipa/diacritics.tsv cldf-clts/clts#129

Closed

LinguList reviewed Nov 18, 2021

View reviewed changes

src/pyclts/transcriptionsystem.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposition for parsing multi-character diacritics (see issue #45)#46

Proposition for parsing multi-character diacritics (see issue #45)#46
XachaB wants to merge 1 commit intocldf-clts:masterfrom
XachaB:multichar_diacritic_parsing

XachaB commented Nov 1, 2021 •

edited

Loading

Uh oh!

Uh oh!

XachaB Nov 1, 2021

Uh oh!

LinguList commented Nov 12, 2021

Uh oh!

codecov-commenter commented Nov 12, 2021 •

edited

Loading

Uh oh!

LinguList left a comment

Uh oh!

Uh oh!

xrotwang commented Nov 18, 2021

Uh oh!

XachaB commented Nov 18, 2021

Uh oh!

LinguList commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

XachaB commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

XachaB Nov 1, 2021

Choose a reason for hiding this comment

Uh oh!

LinguList commented Nov 12, 2021

Uh oh!

codecov-commenter commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LinguList left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xrotwang commented Nov 18, 2021

Uh oh!

XachaB commented Nov 18, 2021

Uh oh!

LinguList commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XachaB commented Nov 1, 2021 •

edited

Loading

codecov-commenter commented Nov 12, 2021 •

edited

Loading