Difference between revisions of "Encoding"

From CrossWire Bible Society
Jump to: navigation, search
(External links: * [http://www.ewellic.org/compression.html A survey of Unicode compression] ‐ by Doug Ewell.)
(See also: * Conversion to Unicode)
 
(36 intermediate revisions by 2 users not shown)
Line 22: Line 22:
 
Here is Unicode. But first, learn this: '''Unicode is not an encoding'''
 
Here is Unicode. But first, learn this: '''Unicode is not an encoding'''
  
[http://en.wikipedia.org/wiki/Unicode Unicode] is an industry standard where every character in any real language, past or present (about 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.
+
[http://en.wikipedia.org/wiki/Unicode Unicode] is an industry standard where every character in any real language, past or present (now over 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.
  
Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also define many different encodings to store code-point.
+
Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also defines many different encodings to store code-points.
  
 
== Multi-bytes encoding ==
 
== Multi-bytes encoding ==
 
One word = One character was a great rule, let's keep it. When Unicode was young, Code-point were all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called [http://en.wikipedia.org/wiki/UTF-16/UCS-2 UCS2]. UCS2 is deprecated.
 
One word = One character was a great rule, let's keep it. When Unicode was young, Code-point were all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called [http://en.wikipedia.org/wiki/UTF-16/UCS-2 UCS2]. UCS2 is deprecated.
  
Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UCS4], or UTF32.
+
Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UCS4], or UTF-32.
  
For an English text, because every character is encoded with 4 bytes, the size if 4 time bigger than the same text in ASCII encoding.
+
For an English text, because every character is encoded with 4 bytes, the size is 4 time bigger than the same text in ASCII encoding.
  
 
== Variable byte length encoding ==
 
== Variable byte length encoding ==
Line 41: Line 41:
  
 
== Common mistakes ==
 
== Common mistakes ==
If you use a wrong encoding, some characters will not be displayed correctly as the computer decoded it using incorrect rules.
+
If you use a wrong encoding, some characters will be displayed [http://en.wikipedia.org/wiki/Mojibake incorrectly] as the computer decoded it using incorrect rules.
  
 
The two most common encoding are ISO-8859-1 and UTF-8
 
The two most common encoding are ISO-8859-1 and UTF-8
  
=== 1. An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*) ===
+
=== An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*) ===
  
 
You will see two or three characters for non-ASCII ones, usually à and ©.
 
You will see two or three characters for non-ASCII ones, usually à and ©.
Line 52: Line 52:
 
  Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.
 
  Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.
  
=== 2. An ISO-8859-1 text decoded as UTF-8 ===
+
=== An ISO-8859-1 text decoded as UTF-8 ===
  
 
The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or squares instead:
 
The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or squares instead:
Line 62: Line 62:
 
Warning: If you see squares, it can also be a font problem! This happen if you're using a font which does not define some characters you are using.
 
Warning: If you see squares, it can also be a font problem! This happen if you're using a font which does not define some characters you are using.
  
=== 3. Using a font that does not include the characters you want ===
+
=== A Mac:Central Europe text decoded as ANSI ===
 +
Example: (found in the '''2TGreek''' module conf file)
 +
the individual Gšttingen editions that have appeared since 1935.
 +
...
 +
(Stuttgart: WŸrttembergische Bibelanstalt, 1935; repr. in 9th ed., 1971).
 +
This was converted by means of [http://www.editpadlite.com/ EditPad Lite] and re-encoded as UTF-8
 +
the individual Göttingen editions that have appeared since 1935.
 +
...
 +
(Stuttgart: Württembergische Bibelanstalt, 1935; repr. in 9th ed., 1971).
  
If you use accented greek characters, for example, but do not use a unicode font that supports precomposed greek characters (like @Arial Unicode MS) accented characters will appear as empty rectangles in your text. In BibleCS you can change the font in Options/Preferences/Display.
+
=== Using a font that does not include the characters you want ===
 +
 
 +
If you use accented Greek characters, for example, but do not use a Unicode font that supports precomposed Greek characters (like @Arial Unicode MS) accented characters will appear as empty rectangles in your text. In BibleCS you can change the font in Options/Preferences/Display.
 +
 
 +
== Normalization ==
 +
For module making, it is strongly recommended that Unicode source files encoded as UTF-8 are [http://unicode.org/faq/normalization.html normalized] to NFC. [http://en.wikipedia.org/wiki/Combining_character Combining characters] are permitted in source text, e.g. for diacriticals where there is no precomposed character in the Unicode Standard.<ref>Be aware, however, that many fonts in many operating systems lack such ability, and that many of our front-ends use rendering widgets which could easily [http://en.wiktionary.org/wiki/barf barf] on such functionality.  Xiphos and BibleTime ought to be OK from a renderer's perspective, provided a good system font can be found, but it may be that BibleDesktop would suffer. BibleCS probably has access to good fonts, but it uses very different technologies, so check that one carefully. BPBible, especially older versions, might have a tough time with advanced character sets. Eloquent is probably fine!</ref>
 +
 
 +
=== Searching in modules ===
 +
The simple rule is that if a search request and the indexed text are not normalized the same, there will not be a hit.
 +
 
 +
Modules are prepared for search using StripFilter mechanisms.  Front-ends should be sure to call <tt>SWModule::StripText()</tt> on the user input before passing to the search method to make sure both are normalized the same.
 +
 
 +
Regarding rendering, each front-end should not assume that the module is encoded in a way that works for it. When we did experiments, NFC was the best across the widest variety of front-ends. But no one way was best for every script, font or display engine. It would be best for each front-end to normalize the text before display. This could be different than the normalization used for search.
 +
 
 +
The situation is made even more complicated by changes in the Unicode standard in respect of normalization. How the characters for some languages normalize to NFC has changed in later Unicode versions. One example is in the Myanmar block of the BMP. This has implications for existing and requested modules for Bible translations that use the Burmese script. i.e. for the Judson 1835 translation, and for the S'gaw Karen translation.<ref>The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby codepoints are reordered in ascending order according to their canonical combining class [ccc].</ref><ref>Unicode normalization can easily break Biblical Hebrew text. See on page 9 in the [http://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf SBL Hebrew Font User Manual].</ref>
 +
 
 +
==== Composition Exclusions ====
 +
There are certain precomposed '''Indic''', '''Tibetan''' & '''Hebrew''' letters that decompose but do not recompose under any normalization form. The list of these characters is given here:
 +
* [http://unicode.org/Public/UNIDATA/CompositionExclusions.txt Composition Exclusions]
 +
The decomposed forms are preferred over the precomposed forms, as indicated in:
 +
* [http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table Primary Exclusion List Table]
 +
Composition exclusions are in these Indic scripts: Devanagari<ref>The writing system for Hindi and several other languages, including Nepali.</ref>, Bengali, Gurmukhi<ref>The writing system for Eastern Punjabi.</ref>, Oriya.<BR>
 +
These particular scripts have some canonically decomposable characters that are generally not the preferred form. This can cause serious problems for SWORD search if the front-end users are accustomed to keying these composite characters. They will simply not find any matches!
 +
 
 +
 
 +
'''Notes:'''
 +
<references />
 +
 
 +
== End-of-line characters ==
 +
Some Sword utilities may barf when used on text files with Mac style [http://en.wikipedia.org/wiki/Newline EOLs]. Most Unicode [[DevTools:Text Editors|text editors]] include a menu option to convert EOLs to Unix or Windows style.
  
 
== See also ==
 
== See also ==
 
+
* [[DevTools:Conversion to Unicode|Conversion to Unicode]]
 +
* [[File Formats]]
 
* [[Fonts]]
 
* [[Fonts]]
  
 +
== External links ==
 
* [http://en.wikipedia.org/wiki/Unicode_equivalence Unicode equivalence]
 
* [http://en.wikipedia.org/wiki/Unicode_equivalence Unicode equivalence]
  
== External links ==
 
 
* [http://scripts.sil.org/TECkitIntro TECkit] &ndash; a Text Encoding Conversion toolkit.  
 
* [http://scripts.sil.org/TECkitIntro TECkit] &ndash; a Text Encoding Conversion toolkit.  
  
Line 79: Line 117:
 
* [http://www.alanwood.net/unicode/ Alan Wood’s Unicode Resources] &ndash; Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications (including text editors)
 
* [http://www.alanwood.net/unicode/ Alan Wood’s Unicode Resources] &ndash; Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications (including text editors)
  
* [http://www.ewellic.org/compression.html A survey of Unicode compression] &dash; by Doug Ewell.
+
* [http://www.ewellic.org/compression.html A survey of Unicode compression] &ndash; by Doug Ewell.
 +
 
 +
* [http://babelstone.blogspot.com/2011/06/whats-new-in-unicode-61.html What's new in Unicode 6.1 ?] & [http://babelstone.blogspot.co.uk/2012/05/whats-new-in-unicode-62.html What's new in Unicode 6.2 ?] &ndash; by Andrew West.
 +
 
 +
* [http://www.unicode.org/reports/tr15/ Unicode Normalization Forms] &ndash; Unicode Standard Annex #15
 +
 
 +
* [http://unicode.org/versions/Unicode8.0.0/ Unicode 8.0.0] &ndash; released 2015-06-17
  
 
[[Category:Unicode]]
 
[[Category:Unicode]]

Latest revision as of 18:57, 11 January 2018

As many people are making mistakes with encoding, here is a small text where I try to explain what does exist.

Introduction

Computers are working on words of bits, usually 8, 16, 32 or 64 today. Those words are meanless for the computer, they can be interpreted in many ways, as numbers, as instructions, as characters, as true/false values...

An encoding is a set of rules defining how to translate a sequence of words into a sequence of characters.

Single byte encoding

As computer can easily work with groups of bit called bytes, one of the first solution was to use a different byte value for every character. It's easy to understand and to program: every byte encode a single character; it's space efficient (memory was really expensive), you can size a string by counting bytes... You can have up to 256 characters as a byte can have 256 different values.

When computers was mostly an occidental story, ASCII was one of the first and most successful encoding. It define 128 characters, including upper case and lower case of the Latin alphabet, digits, punctuation and control. Values above 127 are not defined (allowing extensions or controls values). It is a good set of characters to write in English.

EBCDIC is another byte encoding defined by IBM at the same time which include mostly the sames characters but values above 127 are not free.

As many languages are using characters not included in the original ASCII set; many country, or group of country defined new encoding based on the ASCII encoding. Missing characters are filled in the unused 128 values (above 127). As 128 values is not enough for every earth alphabet, we have many of theses encodings. One of the most successful are ISO-8859 and in particular ISO-8859-1 which include characters for 29 western European and African languages.

Unicode

Theses encoding where 1 byte correspond to 1 character are really great for English, it's working well for languages based on the Latin alphabet, but is really bad for other alphabets. Some of them use more than 256 signs.

You can not write a text with characters from different encoding; and when sending a text, you always have to agree on a common encoding.

Here is Unicode. But first, learn this: Unicode is not an encoding

Unicode is an industry standard where every character in any real language, past or present (now over 100,000 characters), is having a different numeric value called a code-point. This code-point can be recorded in many ways. The first 128 code-points are the same values used in the ASCII encoding.

Did you remember, Unicode is not an encoding? Ok, in fact, Unicode also defines many different encodings to store code-points.

Multi-bytes encoding

One word = One character was a great rule, let's keep it. When Unicode was young, Code-point were all under 65536. So we can represent every character with a 2 bytes word, (a 16 bits word). This is called UCS2. UCS2 is deprecated.

Now, Code-point are under 2^32, so we can represent every character with a 4 bytes word, (a 32 bit word). This is called UCS4, or UTF-32.

For an English text, because every character is encoded with 4 bytes, the size is 4 time bigger than the same text in ASCII encoding.

Variable byte length encoding

UCS2 and UCS4 are great, but it takes too much space, and if your old program does work with 1 byte = 1 character, it will not work. We need an encoding where ASCII characters are the same, but with the ability to encode any Unicode character.

So UTF-8 was created. It's a variable-byte length encoding. ASCII character are the same as in an ASCII encoded text; and other characters are encoded using 2, 3 or 4 characters. You should use this for all your SWORD files now.

For programs written to use UCS2, there is also an extension, UTF-16 is designed to use a single 2 bytes word for characters under 65536, and a pair of words above.

Common mistakes

If you use a wrong encoding, some characters will be displayed incorrectly as the computer decoded it using incorrect rules.

The two most common encoding are ISO-8859-1 and UTF-8

An UTF-8 text decoded as ISO-8859-1 (or ISO-8859-*)

You will see two or three characters for non-ASCII ones, usually à and ©.

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

An ISO-8859-1 text decoded as UTF-8

The UTF-8 decoder will find an invalid multi-byte character, it can report an error, or put question marks or squares instead:

Et Dieu vit la lumière, qu'elle était bonne; et Dieu sépara la lumière d'avec les ténèbres.

will show as

Et Dieu vit la lumi?re, qu'elle ?tait bonne; et Dieu s?para la lumi?re d'avec les t?n?bres.

Warning: If you see squares, it can also be a font problem! This happen if you're using a font which does not define some characters you are using.

A Mac:Central Europe text decoded as ANSI

Example: (found in the 2TGreek module conf file)

the individual Gšttingen editions that have appeared since 1935. 
...
(Stuttgart: WŸrttembergische Bibelanstalt, 1935; repr. in 9th ed., 1971).

This was converted by means of EditPad Lite and re-encoded as UTF-8

the individual Göttingen editions that have appeared since 1935. 
...
(Stuttgart: Württembergische Bibelanstalt, 1935; repr. in 9th ed., 1971).

Using a font that does not include the characters you want

If you use accented Greek characters, for example, but do not use a Unicode font that supports precomposed Greek characters (like @Arial Unicode MS) accented characters will appear as empty rectangles in your text. In BibleCS you can change the font in Options/Preferences/Display.

Normalization

For module making, it is strongly recommended that Unicode source files encoded as UTF-8 are normalized to NFC. Combining characters are permitted in source text, e.g. for diacriticals where there is no precomposed character in the Unicode Standard.[1]

Searching in modules

The simple rule is that if a search request and the indexed text are not normalized the same, there will not be a hit.

Modules are prepared for search using StripFilter mechanisms. Front-ends should be sure to call SWModule::StripText() on the user input before passing to the search method to make sure both are normalized the same.

Regarding rendering, each front-end should not assume that the module is encoded in a way that works for it. When we did experiments, NFC was the best across the widest variety of front-ends. But no one way was best for every script, font or display engine. It would be best for each front-end to normalize the text before display. This could be different than the normalization used for search.

The situation is made even more complicated by changes in the Unicode standard in respect of normalization. How the characters for some languages normalize to NFC has changed in later Unicode versions. One example is in the Myanmar block of the BMP. This has implications for existing and requested modules for Bible translations that use the Burmese script. i.e. for the Judson 1835 translation, and for the S'gaw Karen translation.[2][3]

Composition Exclusions

There are certain precomposed Indic, Tibetan & Hebrew letters that decompose but do not recompose under any normalization form. The list of these characters is given here:

The decomposed forms are preferred over the precomposed forms, as indicated in:

Composition exclusions are in these Indic scripts: Devanagari[4], Bengali, Gurmukhi[5], Oriya.
These particular scripts have some canonically decomposable characters that are generally not the preferred form. This can cause serious problems for SWORD search if the front-end users are accustomed to keying these composite characters. They will simply not find any matches!


Notes:

  1. Be aware, however, that many fonts in many operating systems lack such ability, and that many of our front-ends use rendering widgets which could easily barf on such functionality. Xiphos and BibleTime ought to be OK from a renderer's perspective, provided a good system font can be found, but it may be that BibleDesktop would suffer. BibleCS probably has access to good fonts, but it uses very different technologies, so check that one carefully. BPBible, especially older versions, might have a tough time with advanced character sets. Eloquent is probably fine!
  2. The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby codepoints are reordered in ascending order according to their canonical combining class [ccc].
  3. Unicode normalization can easily break Biblical Hebrew text. See on page 9 in the SBL Hebrew Font User Manual.
  4. The writing system for Hindi and several other languages, including Nepali.
  5. The writing system for Eastern Punjabi.

End-of-line characters

Some Sword utilities may barf when used on text files with Mac style EOLs. Most Unicode text editors include a menu option to convert EOLs to Unix or Windows style.

See also

External links

  • TECkit – a Text Encoding Conversion toolkit.
As well as being able to convert between encodings, it can also be used to convert text from one script to another, for any language that can be represented by more than one script. Example: Some languages in Central Asia can be represented in either Latin or Cyrillic script.