Normalize IPA strings to NFC for consistency

Hi, I noticed that the dataset contains a mixture of NFC and NFD Unicode forms for IPA strings. For example:

* Row 277, Col 7: `ãː` (NFD: `a` + COMBINING TILDE) vs. `ãː` (NFC: single precomposed `ã`).

Out of ~5.1 million cells, ~6,200 are not in NFC. This causes issues with string matching, e.g., `"ã" != "ã"` even though they look identical.

To fix this, I applied NFC normalization across the CSV like this:

```python
import csv, unicodedata

with open("input.csv", "r", encoding="utf-8", newline="") as infile, \
     open("output.csv", "w", encoding="utf-8", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        writer.writerow([unicodedata.normalize("NFC", cell) for cell in row])
```

Example of a normalized cell:

```
Row 1747, Col 8
  Original   : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+006F U+031E U+0303 U+0020 U+0254]
  Normalized : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+00F5 U+031E U+0020 U+0254]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize IPA strings to NFC for consistency #382

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Normalize IPA strings to NFC for consistency #382

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions