Skip to content

Proposition for parsing multi-character diacritics (see issue #45)#46

Open
XachaB wants to merge 1 commit intocldf-clts:masterfrom
XachaB:multichar_diacritic_parsing
Open

Proposition for parsing multi-character diacritics (see issue #45)#46
XachaB wants to merge 1 commit intocldf-clts:masterfrom
XachaB:multichar_diacritic_parsing

Conversation

@XachaB
Copy link

@XachaB XachaB commented Nov 1, 2021

This is a potential implementation for parsing multi-character diacritics, such as ultra-long (see issue #45 )

Currently, diacritics are parsed by iterating on each character leftover after matching sounds. This does not allow for recognizing multi-char diacritics.

One alternative is to construct regexes for diacritics (for each type of sound, a regex for pre diacritics, a regex for post diacritics), and use them to split the string of remaining diacritics. This is what I am doing here, with the exact same functionality otherwise.

features[self._feature_values[feature]] = feature
grapheme += dia[1]
sound += self.features[base_sound.type][feature][1]
for dia in post_dia_regex.split(post):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered refactoring these two nearly identical blocks to avoid repetition, but since I am not an usual contributor to pyclts, I opted for staying as close as possible to the current implementation.

@LinguList
Copy link
Contributor

Sorry, @XachaB, I only now saw this PR. I have been very busy lately. I'll check this later next week and also answer on the issue.

@codecov-commenter
Copy link

codecov-commenter commented Nov 12, 2021

Codecov Report

Merging #46 (07b85b7) into master (c7420d9) will increase coverage by 0.00%.
The diff coverage is 97.05%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #46   +/-   ##
=======================================
  Coverage   94.65%   94.66%           
=======================================
  Files          33       33           
  Lines        1760     1780   +20     
=======================================
+ Hits         1666     1685   +19     
- Misses         94       95    +1     
Impacted Files Coverage Δ
src/pyclts/transcriptionsystem.py 96.55% <97.05%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7420d9...07b85b7. Read the comment docs.

Copy link
Contributor

@LinguList LinguList left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XachaB, I am hesitant about this extension. While I have no objections against the code so far (although I'd have to check it more closely, what I cannot do by now), I do not know whether it is useful or not to allow cases of combined diacritics. Regarding the number of sounds that we have there, it may not seem that it is needed by now, if it is only for one diacritic of ultra-long.

So I'd ask @xrotwang for an additional opinion, and also @cormacanderson, who's involved in coding up things in the data, what they think about handling diacritic combinations as a feature per se in the diacritics table.

Generally, it makes sense, but I am always hesitant to add more code, as it is already not easy to keep track with what the system can generate and what it fails.

@xrotwang
Copy link
Contributor

I think I'd agree with @LinguList that explicitly listing ultra-long phonemes in the respective data files would be more transparent, than wholesale acceptance of stacked diacritics. While CLTS BIPA isn't as strict as e.g. Concepticon in only aggregating phonemes that have been encountered "in the wild", I'd still say that stacked diacritics may be more often signaling problems with the data than actual phonemes. In addition, I think that the code in transcriptionsystems is already more complex than I'd like it to be. So any additions to it feel like going into the wrong direction.

@XachaB
Copy link
Author

XachaB commented Nov 18, 2021

Ok, that makes sense.

I understand well the wish not to complexify the system, especially for a single two-char diacritic. Then I won't lose time trying to find why there's a test that's not passing -- glad I didn't do so earlier too. Can I then make a PR to CLTS (not pyclts) for a set of extra-long sounds in the respective bipa data files ?

I do need some way for these extra-long diacritics not to be ignored by pyclts.

@LinguList
Copy link
Contributor

Sure, @XachaB. thanks in advance for your help here! We look forward to the PR in CLTS!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants