Skip to content

unicode confusables and normalization #23

@LinguList

Description

@LinguList

We have more or less clarified this in code already:

  • normalize is a one to many conversion procedure, only single characters are allowed, it is transcriptionsystem specific, as it is possible that different systems normalize in different ways
  • confusables going beyond this are excluded and placed into the alias section

But we also started to collect things in cldf/multicode. Many of the examples there belong to what we would use to normalize a dataset. But not all.

I think we can drop multicode, as it was never really followed up, and we'd have to think how to integrate it into any of our tools (maybe one could use it for normalization in linse, where we also have a small normalization procedure for bipa only, to be able to use linse without depending on pyclts). But we should thoroughly check to have harvested all major characters from the unicode confusables list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions