-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
Hi, I noticed that the dataset contains a mixture of NFC and NFD Unicode forms for IPA strings. For example:
- Row 277, Col 7:
ãː(NFD:a+ COMBINING TILDE) vs.ãː(NFC: single precomposedã).
Out of ~5.1 million cells, ~6,200 are not in NFC. This causes issues with string matching, e.g., "ã" != "ã" even though they look identical.
To fix this, I applied NFC normalization across the CSV like this:
import csv, unicodedata
with open("input.csv", "r", encoding="utf-8", newline="") as infile, \
open("output.csv", "w", encoding="utf-8", newline="") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
writer.writerow([unicodedata.normalize("NFC", cell) for cell in row])Example of a normalized cell:
Row 1747, Col 8
Original : o̞˞ o̞ õ̞ ɔ [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+006F U+031E U+0303 U+0020 U+0254]
Normalized : o̞˞ o̞ õ̞ ɔ [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+00F5 U+031E U+0020 U+0254]
Metadata
Metadata
Assignees
Labels
No labels