Talk:Localized Language Names

From CrossWire Bible Society
Revision as of 01:58, 11 February 2013 by Dmsmith (talk | contribs) (Language codes not yet having a native form)

Jump to: navigation, search

Specifying Scripts

Some languages have multiple scripts. What is the proper way to show that? E.g. Traditional vs Simplified Chinese? And I think Azeri has multiple scripts.

For BibleDesktop, we have localized zh (traditional) and zh_CN (simplified). This is not quite right, but fits pragmatically based on Java's locale and resource bundle mechanism.

For Java a locale is lang, lang_country, lang_country_dialect, or lang__dialect (where country is unstated). There is no standard for dialect, so it could be used for script.

I think there is a standard for scripts and that CLDR and ICU are starting to do something with it. Any thoughts??? --Dmsmith 00:37, 13 November 2009 (UTC)

Script codes come from ISO 15924, and according to BCP 47, the way to include them in a locale is between the language and region, so en-Latn-US would be English written in Latin script in/for the US. (This is a bad example because BCP 47 says not to overspecify by naming a script when it should be obvious.) BCP 47 does specify using hyphen to separate tags, but I would guess that Java would be happy with the same style of tags if you just map the hyphens to underscores.
Traditional Chinese is zh-Hant, simplified is zh-Hans. Mongolian in Mongolian script, Cyrillic, and Latin would be mn-Mong, mn-Cyrl, and mn-Latn respectively.
This is all incorporated into CLDR and ICU, but more importantly is officially recognized/maintained by ISO, Unicode, and IANA. --Osk 03:11, 13 November 2009 (UTC)
Many thanks. For Chinese you gave the simplified form. When we add support to SWORD for these language codes, how do you see us supporting/specifying both Hans and Hant? --Dmsmith 10:32, 13 November 2009 (UTC)

Native Forms

Perhaps you may have gathered from my response on sword-devel. I'm not really keen on showing the native name. I think there are too many difficulties.

  • Many of these show up as boxes when browsing the page. What font should be used? (In CSS, it is possible to suggest fonts. That way if the user had them installed, they could view the page.)
  • Is grc really koine? I didn't think they spoke koine in 1400. And when koine was spoken, I think they used all caps and didn't use accents. And spaces between words did not exist.
  • In Middle English or Old English, would they really have spelled Middle, Old and English that way? I think these are modern forms, not native forms. And would they have described their language as Old, or Middle? After all, it is really difficult to read middle let alone old English.
  • In awc, Western Acipa, do they really say Western?
  • In cak, Central Cakchiquel ,do they really say, Central?
  • If a module is in the ug language, when there are several script choices, which is it Uyghurche‎ or ئۇيغۇرچە? That is, does a person who speaks ug know both scripts? Or is one of them "squiggly worms?"

For this reason, I think it is a good idea to have native forms, but for it to be flexible for frontends to decide how to implement it.

JSword uses language names as nodes in a tree. If they are localized to the end user, then it is clear what they are getting. If they can't read it, perhaps because of the script or because of a wrong font, then all they know is that it doesn't make sense to them and to avoid it. (To me it makes sense in viewing a conf to show the native representation, because that gives it context.)

May I recommend upgrading to Windows 7? :) Somehow it actually has font support for all of these scripts.
There is some noise and there are some errors in the data from Xiphos. The contents of the page here simply consisted of Xiphos' data minus all entries that were exactly identical to the ISO 639-3 names. Almost all of the ancient language names are incorrect. I think labeling grc as koine was well-intentioned, but it is certainly incorrect. If we especially want to tag resources as koine, we could by adding our own subtag, e.g. grc-koine. But I think it serves no good purpose to differentiate koine resources like the GNT & LXX from non-koine resources like Liddel-Scott. I think both upper & lower case forms existed, though they weren't used as casing pairs. I'm not sure about the use of diacritics either. But I think these are tolerable anachronisms.
The names of Old/Middle English and descriptions of various African and American languages need to be weeded out or corrected. I think these were just missed by the diff process because they're slightly modified. In some cases, there's definitely been an attempt to nativize the names, but I've found one so far that is actually slightly offensive in its language. In other words, I think I need to hand-check everything.
Regarding the two (now three) forms of ug, this is a case where a language is commonly written in one of various scripts. Depending on the ethnic group or location of a speaker, one or another script might be dominant. Some speakers may know more than one orthography. We should probably re-tag these as ug-Latn, ug-Arab, and ug-Cyrl. --Osk 21:16, 13 November 2009 (UTC)
Xiphos also uses language names as nodes in a tree in its module manager. David Haslam 17:35, 21 December 2009 (UTC)

Scots language is missing from the table

Although we don't yet have any SWORD module for the Scots language, this language is currently missing from the table.
NB. The existing ScotsGaelic module is not for this language! See issue MOD-121. David Haslam 17:40, 21 December 2009 (UTC)

I think there are many other languages not in this table. I think that these are those for which we have modules in the CrossWire repositories. I don't know if we want much more than that. I can see having localizations for SWORD modules that are in other repositories, for GoBible modules and for interesting texts that we would like to have but for whatever reason don't currently have (e.g. requested modules). Perhaps Chris will elaborate by putting an introduction on the page? --Dmsmith 14:51, 22 December 2009 (UTC)
I'd be happy to have any languages added to the table. The table is just meant to be a collaborative editing facility for anyone willing to dig through the various sites listed at the bottom of the page and extract their data. Later processes (the perl scripts at can handle paring the list down to those that area currently employed by Sword. However, Scots isn't a candidate for inclusion on this page since we only include those languages whose native names differ from the English ISO name. Scots is called Scots in Scots, so there's no need for an entry. --Osk 18:12, 22 December 2009 (UTC)

Explanation still needed

The top of the main page still has, "insert explanation here". Please would someone add a suitable Introduction section before the table. Best to separate the table into its own section for ease of future edits. David Haslam 14:24, 22 December 2009 (UTC)

Gentle reminder! David Haslam 04:18, 28 July 2012 (MDT)
I'll probably delete this page and move everything to SVN. A very large database of language codes & names is simply a bit impractical to store in a Wiki. Ultimately, I'd just like to move the table into the API so that we can request a language name based on a code and get the answer from the API. --Osk 22:26, 29 July 2012 (MDT)

Xiphos languages file

Xiphos has a very comprehensive languages file. For Win32 this is installed in "C:\Program Files (x86)\CrossWire\Xiphos\share\xiphos" David Haslam 04:11, 28 October 2012 (MDT)

Other Language Codes

If anyone wants to do the legwork. Here are the language codes on the CrossWire server (as of Feb 10, 2013) under ~ftp/pub that are not in the table yet:

amu=Amuzgo, Guerrero
ava=Avaric -- Not actually used in a module but is an alternate code for av
cco=Chinantec, Comaltepec
chd=Chontal, Highland Oaxaca
chq=Chinantec, Quiotepec
chz=Chinantec, Ozumacín
ckb=Kurdish, Central
ckw=Cakchiquel, Western
cnl=Chinantec, Lalana
cnt=Chinantec, Tepetotutla
crh=Tatar, Crimean
cso=Chinantec, Sochiapan
cti=Chol, Tila
ctp=Chatino, Western Highland
enm=English, Middle [1100-1500]
huv=Huave, San Mateo Del Mar
ixl=Ixil, San Juan Cotzal
jac=Jacalteco, Eastern
jvn=Javanese, Caribbean
lzh=Chinese, Literary
mir=Mixe, Isthmus
miz=Mixtec, Coatzospan
mks=Mixtec, Silacayoapan
mvc=Mam, Central
mvj=Mam, Todos Santos Cuchumatán
mxq=Mixe, Juquila
mxt=Mixtec, Jamiltepec
ncl=Nahuatl, Michoacán
ngu=Nahuatl, Guerrero
nhy=Nahuatl, Northern Oaxaca
otq=Otomi, Querétaro
sml=Sama, Central
tzz=Tzotzil, Zinacantán
xtd=Mixtec, Diuxi-Tilantongo
zab=Zapotec, San Juan Guelavía
zaw=Zapotec, Mitla
zpo=Zapotec, Amatlán
zpq=Zapotec, Zoogocho
zpu=Zapotec, Yalálag
zpv=Zapotec, Chichicapan
zsr=Zapotec, Southern Rincon
ztq=Zapotec, Quioquitani-Quierí
zty=Zapotec, Yatee

These are also present but I don't that these should be in the table.

en-GB=English (UK)
en-US=English (US)
mul=Multiple languages
zxx=No linguistic content