Talk:Localized Language Names

From CrossWire Bible Society
Revision as of 21:16, 13 November 2009 by Osk (Talk | contribs) (Native Forms)

Jump to: navigation, search

Specifying Scripts

Some languages have multiple scripts. What is the proper way to show that? E.g. Traditional vs Simplified Chinese? And I think Azeri has multiple scripts.

For BibleDesktop, we have localized zh (traditional) and zh_CN (simplified). This is not quite right, but fits pragmatically based on Java's locale and resource bundle mechanism.

For Java a locale is lang, lang_country, lang_country_dialect, or lang__dialect (where country is unstated). There is no standard for dialect, so it could be used for script.

I think there is a standard for scripts and that CLDR and ICU are starting to do something with it. Any thoughts??? --Dmsmith 00:37, 13 November 2009 (UTC)

Script codes come from ISO 15924, and according to BCP 47, the way to include them in a locale is between the language and region, so en-Latn-US would be English written in Latin script in/for the US. (This is a bad example because BCP 47 says not to overspecify by naming a script when it should be obvious.) BCP 47 does specify using hyphen to separate tags, but I would guess that Java would be happy with the same style of tags if you just map the hyphens to underscores.
Traditional Chinese is zh-Hant, simplified is zh-Hans. Mongolian in Mongolian script, Cyrillic, and Latin would be mn-Mong, mn-Cyrl, and mn-Latn respectively.
This is all incorporated into CLDR and ICU, but more importantly is officially recognized/maintained by ISO, Unicode, and IANA. --Osk 03:11, 13 November 2009 (UTC)
Many thanks. For Chinese you gave the simplified form. When we add support to SWORD for these language codes, how do you see us supporting/specifying both Hans and Hant? --Dmsmith 10:32, 13 November 2009 (UTC)

Native Forms

Perhaps you may have gathered from my response on sword-devel. I'm not really keen on showing the native name. I think there are too many difficulties.

  • Many of these show up as boxes when browsing the page. What font should be used? (In CSS, it is possible to suggest fonts. That way if the user had them installed, they could view the page.)
  • Is grc really koine? I didn't think they spoke koine in 1400. And when koine was spoken, I think they used all caps and didn't use accents. And spaces between words did not exist.
  • In Middle English or Old English, would they really have spelled Middle, Old and English that way? I think these are modern forms, not native forms. And would they have described their language as Old, or Middle? After all, it is really difficult to read middle let alone old English.
  • In awc, Western Acipa, do they really say Western?
  • In cak, Central Cakchiquel ,do they really say, Central?
  • If a module is in the ug language, when there are several script choices, which is it Uyghurche‎ or ئۇيغۇرچە? That is, does a person who speaks ug know both scripts? Or is one of them "squiggly worms?"

For this reason, I think it is a good idea to have native forms, but for it to be flexible for frontends to decide how to implement it.

JSword uses language names as nodes in a tree. If they are localized to the end user, then it is clear what they are getting. If they can't read it, perhaps because of the script or because of a wrong font, then all they know is that it doesn't make sense to them and to avoid it. (To me it makes sense in viewing a conf to show the native representation, because that gives it context.)

May I recommend upgrading to Windows 7? :) Somehow it actually has font support for all of these scripts.
There is some noise and there are some errors in the data from Xiphos. The contents of the page here simply consisted of Xiphos' data minus all entries that were exactly identical to the ISO 639-3 names. Almost all of the ancient language names are incorrect. I think labeling grc as koine was well-intentioned, but it is certainly incorrect. If we especially want to tag resources as koine, we could by adding our own subtag, e.g. grc-koine. But I think it serves no good purpose to differentiate koine resources like the GNT & LXX from non-koine resources like Liddel-Scott. I think both upper & lower case forms existed, though they weren't used as casing pairs. I'm not sure about the use of diacritics either. But I think these are tolerable anachronisms.
The names of Old/Middle English and descriptions of various African and American languages need to be weeded out or corrected. I think these were just missed by the diff process because they're slightly modified. In some cases, there's definitely been an attempt to nativize the names, but I've found one so far that is actually slightly offensive in its language. In other words, I think I need to hand-check everything.
Regarding the two (now three) forms of ug, this is a case where a language is commonly written in one of various scripts. Depending on the ethnic group or location of a speaker, one or another script might be dominant. Some speakers may know more than one orthography. We should probably re-tag these as ug-Latn, ug-Arab, and ug-Cyrl. --Osk 21:16, 13 November 2009 (UTC)