Skip to content

Internationalization of page titles #140

@replaid

Description

@replaid

As a wiki author working in a language with a non-Latin script, I want to be able to link to a page like [[Гильдии]] (the equivalent of [[Guilds]] in Russian). Currently wiki strips out all the non-Latin characters from page titles, so Гильдии converts to a slug that is the empty string.

In the specific case of the Russian language, the alphabet is very phonetic, so many Russian websites have software to solve this problem by mapping the Cyrillic letters to Latin letters or clusters of Latin letters, in this case Gil'dii. However, it seems likely that such a "Russian mode" would not be the best solution for wiki.

It looks to me like there is a fork in the road. I will call the two general paths I see "incompatible slug" and "compatible slug" as a best effort to describe this.

Incompatible slug

We could opt to present non-Latin page title characters directly in the URLs and let the backend map those to and from some kind of encoding for the filenames (or even filenames in a subset of UTF-8). This is the general approach Wikipedia takes. When someone links to [[Гильдии]], Wikipedia URLencodes that Unicode page title into the link like <a href="/wiki/%D0%93%D0%B8%D0%BB%D1%8C%D0%B4%D0%B8%D0%B8">. Users see /wiki/Гильдии in the URL bar. I don't know what exactly happens on the back end.

I suspect this approach has a different texture than the current approach of Federated Wiki. I think it would represent a change in direction and would have repercussions that would upset the "just enough" ethos that characterizes this project.

Compatible slug

Alternatively, we could convert non-Latin letters to some slug that is compatible with the current backend.

This is pretty much exactly how domain names with non-Latin letters are handled: they are mapped to Latin letters using Punycode (in which Гильдии is lowercased and encodes to xn--c1aclbap3j), and are then compatible with existing DNS. Slugs that are not based primarily on Latin characters are unreadable to humans, but they are unambiguously decodable to a lowercased version of the non-Latin input.

If we were to add Punycode encoding to the asSlug method, I think a lot of things would just work, especially based on how well wiki works when I create a page by typing [[по-русски]]: this discards all the letters and just creates the slug -, but this works fine as long as I don't make another such page.

I have not yet dug into the code for sitemaps and searches, but I would imagine that this code would need to become aware of Punycode-decoding slugs that begin with xn--. But once that's in, I think it would be quite transparent.

  1. Author links to [[Гильдии]]
  2. Slug is calculated to be xn--c1aclbap3j
  3. Click to create the page, the file is created with filename xn--c1aclbap3j
  4. Reader types гил into the search bar
  5. Search code has seen xn--c1aclbap3j in the sitemap information and decoded that to гильдии for search matching purposes
  6. гил matches as a substring of гильдии just as guil matches as a substring of guilds for a page named Guilds.
  7. Reader is given a link to the page with slug xn--c1aclbap3j and clicks it
  8. Wiki sees the Punycode and while the page loads displays the slug in lowercased non-Latin characters like гильдии, which is replaced with Гильдии when the page title loads

So the only real user-visible wart is the presence of the Punycode in the link itself.

This could even be a stepping stone to eventually keeping the client experience in native languages in future steps.

I may be missing other uses of the slug that need to be accounted for.

This issue is an offshoot of conversations at fedwiki/wiki-client#103 and #139.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions